12

Click here to load reader

Choosing the Right Statistical Test

  • Upload
    raywood

  • View
    94

  • Download
    2

Embed Size (px)

DESCRIPTION

Atypical exploration of certain concepts in introductory statistics.

Citation preview

Page 1: Choosing the Right Statistical Test

Choosing the Right Statistical Test

Ray Woodcock

March 11, 2013

I commenced the following manuscript in summer 2012, in an attempt to work through and

articulate basic matters in statistics. I hoped that this would clarify matters for me and for others

who struggle with introductory statistics. I did not know whether I would have time to complete

the manuscript, and in fact I have not had time for that, and am not certain when I will. I am

therefore providing it in this imperfect and incomplete form. The manuscript begins here.

Preliminary Clarification

It seemed that some preliminary remarks might be advisable. First, regarding the entire

enterprise of statistics, I had repeatedly encountered warnings that statistics is as much an art as a

science, and that even experienced statisticians could make decisions that would prove

unwarranted. Among beginners, of course, mistakes would be relatively obvious; there were

many ways to do things visibly wrong. Statistics could be a fairly black-and-white affair at the

basic level. But in some situations, it seemed, one might not need to go far off the beaten track

to encounter disagreements and misunderstandings even among relatively well-trained users of

statistics.

As a second preliminary observation, it seemed appropriate to observe that what was called

“statistics” in introductory and intermediate courses in statistics tended to represent only a subset

of the larger universe of statistical activity. Statistics books commonly referred to a distinction

between theoretical and applied statistics, though this distinction did not appear to be rigidly and

universally adopted. For example, the University of California – Davis offered divergent

undergraduate foci in general statistics, applied statistics, and computational statistics, while

Bernstein and Bernstein (1999, p. iii) characterized general statistics (with no mention of applied

statistics) as

an interpretative discipline . . . in which the presentation is greatly simplified and often

nonmathematical. . . . [from which] each specialized field (e.g., agriculture, anthropology,

biology, economics, engineering, psychology, sociology) takes material that is

appropriate for its own numerical data.

It appeared reasonable, for present purposes, to treat theoretical statistics as a predominantly

philosophical and mathematical enterprise; to treat applied statistics as the adaptation of

theoretical statistics to various research situations, covering the real-world half (or so) of

statistical research (as in e.g., the Annals of Applied Statistics); and to treat general statistics as a

catch-all reference to the broad variety of statistical work, typically but not necessarily having an

applied orientation (as at e.g., the Washington University Center for Applied Statistics). My

experience with statistics was largely in the applied rather than theoretical area, and this was the

zone in which I had encountered the variety of statistical tools mentioned at the outset.

Page 2: Choosing the Right Statistical Test

2

Third, within applied statistics courses and texts, there seemed to be a widely accepted

distinction between descriptive and inferential statistics. It appeared, however, that the actual

difference might vary somewhat from its depiction. Descriptive statistics seemed to be construed

as a catch-all category of noninferential tools, including both the numerical (e.g., standard

deviations) and the non-numerical (e.g., graphs) (e.g., Wikipedia; Gravetter and Wallnau, 2000,

p. 1). This approach seemed to raise several problems:

• In ordinary usage, this approach to descriptive statistics had the potential to privilege

technical devices (e.g., numbers, graphs) beyond their actual significance. A table, the

basis for a potential graph, was generally just a convenient way of stating numbers that

could instead have been recited in a paragraph. In other words, if the concept of

descriptive statistics included graphs, it would also have to include text.

• Descriptive statistics, construed as just the textual or other description of statistical data,

could not really be segregated into a few chapters at the start of a textbook. It might

make sense to have one or more chapters on standard deviations, graphs, and so forth.

But whereas a chapter on ANOVA might say all that an author would intend to say on

that subject, a chapter on graphs or standard deviations would be unlikely to capture the

role played by graphs and standard deviations. To the contrary, graphs and standard

deviations – and descriptive text – invariably pervaded statistical articles and books. One

could have a statistics work without any inferential statistics; but one could not have a

statistics work without descriptive content. Indeed, one could not even have inferential

statistics without descriptive content: the understanding and use of inferential statistics

were heavily dependent upon standard deviations, graphs, and descriptive text, especially

but not only in applied contexts.

• Inferential statistics could be mischaracterized as something more than a description of

data, when in fact it was something less than a description of data. Description was

necessarily based upon observation, whereas inference tended to be a derivative matter of

speculation based upon description. Such speculation could guide the interpretation of

observation – suggesting, for example, observations that might deserve more or less

attention – but speculation could not replace or do without the underlying descriptive

substance.

In short, it seemed that the applied statistician working with inferential material would be

especially concerned with identifying frontiers beyond which inference, ultimately expressed and

interpreted in largely textual terms, would be relatively unsupported mathematically.

Metaphorically, my interest in statistical tools was particularly oriented toward the beach of

application, on an island of inferential statistics, within a descriptive, noninferential ocean that

would sustain and also limit inference.

A fourth preliminary remark had to do with the relationship between inference and probability.

The foregoing discussion seemed to suggest that Wikibooks (2011, p. 3) was overly simplistic in

its assertion that “Statistics, in short, is the study of data.” Even cockroaches study data.

Statistics could not be purely mathematical, but it did have undeniable roots in mathematics.

Both inference and probability were common topics in statistics; both were mathematical; the

question at hand was, how were these mathematical topics related? Wolfram MathWorld

seemed mistaken in suggesting that “The analysis of events governed by probability is called

Page 3: Choosing the Right Statistical Test

3

statistics” because – again, as hinted above – statistics included a great deal of nonprobabilistic

content. A better version, built from the preceding paragraph, was that applied inferential

statistics was primarily concerned with determining the degree of probabilistic support for

inferences from available information. This seemed akin to the suggestions, by Kuzma and

Bohnenblust (2004, pp. 3, 71), that “Inferential statistics is concerned with reaching conclusions

from incomplete information – that is, generalizing from the specific” but that “Probability is the

ratio of the number of ways the specified event can occur to the total number of equally likely

events that can occur.” In other words, it seemed that probability could be treated as the highly

abstract, mathematical engine used in a larger and potentially messier inferential statistics

project.

Given these observations and conclusions, it appeared that this post’s goal of describing the

selection of a particular statistical tool was, more precisely, a goal of identifying the optimal

configuration of the probability engine for purposes of a particular task. Within the grand

ambiance of numbers, language, and graphics, there was a narrow question of whether

probability would support a certain inference from specified data. The study of probability had

developed better and worse ways to answer that question. At this relatively basic level, there

was apparently some consensus among statisticians as to which ways were better. The task of

this post, then, was to describe the indicators that would guide a statistician’s choice of

probabilistic tools for purposes of statistical inference.

The Importance of Data Type

Numerous sources seemed to agree that the type of data would significantly influence the choice

of statistical test at an early stage in the decision process. Stevens (1946, p. 678) posited four

levels of data measurement. Those four, presented in the reverse of their usual order, were

commonly understood as follows:

• Ratio. Data measured on a ratio scale were what one might consider ordinary numbers:

they could be added, subtracted, multiplied, and divided meaningfully. The defining

characteristic of ratio data was the existence of a non-arbitrary zero value. For instance,

one could count down, to zero, the number of seconds until an event. The existence of a

real zero also meant that values can be reported as fractions or multiples of one another

(i.e., in terms of ratios). For example, three pounds of a substance was half as much as

six pounds; two inches on a ruler was half as much as four inches. All statistical

measures could be used with ratio data. Thus, researchers generally considered ratio data

ideal.

• Interval. Data measured on an interval scale lacked a true zero; zero was just one more

interval. For example, the Fahrenheit and Celsius (as distinct from Kelvin) temperature

scales used arbitrary zero values that did not indicate an actual absence of the thing being

measured. Therefore, multiplication and division were not directly meaningful; 33

degrees (F or C) was not half as warm as 66 degrees. Without division, some statistical

tools would be unavailable for interval data. As with ratio-level data, the existence of

uniform increments did provide consistency (for example, April 2 and 5 were as far apart

as November 8 and 11) and the possibility of subdivision into a continuous series of ever-

smaller subdivisions (e.g., 1.113 degrees would be more than 1.112 degrees). The barrier

Page 4: Choosing the Right Statistical Test

4

to meaningful division of interval values would not prevent the calculation of ratios of

differences between values.

• Ordinal. Data measured on an ordinal scale were not really mathematical. Numbers

were used just to show the rank order of entries, not the number of increments or the

difference between some value and true zero. For instance, a racer who came in fourth

was not necessarily half as fast as someone who came in second. A letter grading system

(A, B, C, D, F) was ordinal; so were IQ scores and Likert scale data (e.g., 1 = strongly

disagree, 5 = strongly agree). In such examples, the differences between the first and

second items on the scale were not necessarily equal to the differences between, say, the

fourth and fifth items. Values were discrete rather than continuous; for instance, nobody

in a race would finish in position 2.7. These limits would further restrict the kinds of

statistical tools that could be used on ordinal data.

• Nominal. Data on a nominal scale could not be measured, in terms of the quality being

studied. They were simply sorted into categories (and could thus be counted). The

categories had no intrinsic arrangement. For example, unlike coming in first rather than

second in a race, being an apple was not, in itself, better than being an orange (though of

course there might be measurable differences in the chemical properties of pieces of fruit

extracted from apples and oranges). Unlike IQ scores or letter grades, the concept of

“house” did not automatically establish a rank order among ranch, Cape Cod, and

colonial types of houses; sorting would have to be arbitrary (by e.g., age or alphabetical

order). Few statistical tools were available for nominal data.

While the distinctions among these four data types would be important eventually, some decision

trees made a first distinction between the qualitative (also called categorical) data found in

ordinal and nominal scales and the quantitative (also called numerical) data found in ratio and

interval scales. This post thus proceeds with an introductory look at qualitative data analysis.

Nonparametric Analysis: Statistical Tools for Qualitative Data

The study of statistical methods seemed to make a fundamental distinction between parametric

and nonparametric analysis of data. Parameters were mathematical characteristics of a

population – of, that is, a potentially large class of entities (e.g., cigarette smokers, beans in a

bag, cars in New York City). Means and standard deviations seemed to be the two parameters

most commonly mentioned in inferential statistics. The mean was the preferred measure of

central tendency – of where the bulk of the data tended to be located. The standard deviation

was the preferred measure of dispersion – of how much the individual values varied from the

mean. Together, the mean and standard deviation would give an approximate, concise sense of

how individual values tended to be distributed. Since means and standard deviations could not

be calculated for ordinal or nominal data, parametric methods were limited to ratio and interval

data.

Parametric methods would require certain assumptions about the population being studied. Such

assumptions could yield powerful and accurate estimates of population parameters, if justified,

but considerable error if unjustified. By contrast, nonparametric methods would depend upon

fewer assumptions, but would not support comparably powerful inferences, and therefore tended

to be treated as less desirable alternatives to parametric methods. But when testing samples

Page 5: Choosing the Right Statistical Test

5

drawn from a population, all parametric and nonparametric tests depended on the crucial

assumption that the data contained in those samples were obtained through random selection

from the relevant population (Pett, 1997, p. 9).

It could seem practical for a statistics textbook to begin with a discussion of nonparametric

methods. Such methods could be more forgiving and more generally applicable across a variety

of quantitative as well as qualitative contexts. In addition, given the relatively slight attention

customarily provided to nonparametric methods, it could be appropriate to get this topic out of

the way, so as to focus on parametric methods.

On the other hand, there were also good reasons to begin with parametric methods. One such

reason was that there were numerous nonparametric methods. For example, in a book that did

not purport to be comprehensive, Pett (1997) discussed approximately 20 such methods. It

appeared that parametric methods were to be considered the standard, whereas one would go

searching among the nonparametric options primarily to meet an atypical need. An attempt to

begin by mastering the nonparametric methods could thus distract or confuse someone who was

only starting to learn about statistical methods.

In short, it appeared that the distinction of parametric versus nonparametric statistics at this point

served primarily to draw attention to the existence of such a distinction, and to encourage efforts

to obtain data amenable to parametric analysis when possible. This discussion therefore turns to

parametric analysis, leaving nonparametric methods as the best approach when dealing with

ordinal or nominal data, or when parametric assumptions were violated and could not be

counteracted.

Introduction to Parametric Analysis

This post begins with the observation that statisticians used many tools or methods to analyze

data. This observation arose from exposure to introductory statistics textbooks. Such textbooks

commonly offered chapters on descriptive statistics and other matters that might aid in

orientation to, or application of, statistical reasoning. But the central emphasis of such textbooks

typically seemed to be upon statistical inference – upon, that is, the process of making inferences

about populations (especially about their means) based on information derived from samples

drawn from those populations. The analysis of samples was a matter of convenience and

feasibility: studying a sample could be much easier and less expensive than obtaining complete

raw data from an entire population; and if it was a good sample, it could be expected to yield

results representative of the population as a whole.

Customarily, the process of inference would be focused on the testing of a research hypothesis.

The researcher would hypothesize that one sample differed from another. This hypothesis would

be an alternative to the “null” hypothesis, which would state that there was, in fact, no

statistically significant difference between the two. The inferential tool selected for the task

would be the one that could best inform the question of whether there was sufficient evidence to

reject the null hypothesis. In such comparisons, the independent variable would be the thing

being put into comparison groups or otherwise manipulated (e.g., smoker or nonsmoker) to

produce effects upon the dependent variable (e.g., cancer rates).

Page 6: Choosing the Right Statistical Test

6

The question of statistically significant difference was not necessarily the same as a

scientifically, practically, or clinically significant difference. The means of two samples could

be different enough from one another that one could infer, with a fair degree of confidence, that

they came from statistically divergent populations – that, for example, people who received a

certain treatment tended to display a relevant difference from people who did not receive that

treatment. Yet this mathematical difference might not get past the concern that the difference

was so minor (given e.g., the cost or difficulty of the treatment) as to be insignificant in practical

terms. One could perhaps devise separate statistical studies to examine such practical issues, but

practicality was not itself a part of statistical significance.

Key Parametric Assumptions

The process of using a parametric tool could begin with preliminary tests of assumptions. If a

dataset was not compatible with a given tool’s assumptions, that tool might not provide accurate

analysis of the data. Not every tool required satisfaction of every assumption; that is, parametric

tools varied in robustness (i.e., in their ability to tolerate variation from the ideal).

A source at Northwestern University and others suggested that, in addition to the assumptions of

interval or ratio scale data and random sampling (above), the list of parametric assumptions

included independence, an absence of outliers, normality, and equal population or sample

variances (also known as homoscedasticity). These assumptions could be summarized briefly:

• Independence meant that the data points should be independent of other data points, both

within a data set and between data sets. There were exceptions, such as where a tool was

designed specifically for use with datasets that were paired (as in e.g., a comparison of

pre- and post-intervention measurements) or otherwise matched to or dependent on one

another.

• Outliers were anomalous data points – significant departures from the bulk of the data.

Renze (n.d.) and others suggested a rule of thumb by which a data point was deemed an

outlier if it fell more than 1.5 times the interquartile range above or below that range.

Seaman and Allen (2010) warned, however, that one must distinguish outliers from long

tails. Unlike a single outlier (or a small number of odd variations among a much larger

set of relatively consistent data points), a series of extreme values was likely to be a

legitimate part of the data and, as such, should not be removed. Outliers resulting from

errors in data collection seemed to be the best candidates for removal via e.g., a trimmed

mean. For example, a 10% trimmed mean would be calculated on the basis of all data

points except those in the top 5% and the bottom 5% of the data points. Winsorizing

could be used to replace those upper and lower values with repetitions of the next-lowest

and next-highest values remaining after such trimming. (See Erceg-Hurn & Morsevich,

2008, pp. 595-596.)

• A normal (also known as Gaussian) distribution was symmetrical, with data evenly

distributed above and below the mean, and with most data points being found near the

mean and progressively fewer as one moved further away. Wikipedia and others

Page 7: Choosing the Right Statistical Test

7

suggested that one could test normality, first, with graphical methods, notably a

histogram, possibly a stem-and-leaf plot, and/or a Q-Q Plot. The first two would be used

to see whether data appeared to be arranged along a normal curve. There were two

options with the Q-Q Plot: one could use it to compare a sample against a truly normal

distribution or (as noted by a source at the National Institute of Standards and

Technology) against another sample. Either way, significant departures from a straight

line would tend to indicate that the two did not come from the same population. Another

test for normality used the 68-95-99.7 Rule – that is, the rule that, in a normal

distribution, 99.7% of data points (i.e., 299 out of every 300) fell within three standard

deviations (s) of the mean. So if a value in a small sample fell more than 3s from the

mean, it probably indicated a nonnormal distribution. It was also possible to use the

SKEW and KURT functions in Microsoft Excel to get a sense of normality, for up to 30

values. (Excel’s NORMIDIST function would return a normal distribution.) Negative

values from those functions indicated left skew and flat (platykurtic) distribution; positive

values indicated right skew and peaked (leptokurtic) distribution; zero indicated

normality. Finally, there were more advanced tests for normality. These tests were built

into statistical software; some could also be added to Excel. Among these tests, Park

(2008, p. 8) and others indicated that the Shapiro-Wilk and Kolmogorov-Smirnov tests

were especially often used, with the former usually being preferred in samples of less

than 1,000 to 2,000. In addition to professional (sometimes expensive) Excel add-ons,

Mohammed Ovais offered a free Excel spreadsheet to calculate Shapiro-Wilk, and Scott

Guth offered one that reportedly used Kolmogorov-Smirnov. On the other hand, Erceg-

Hurn and Morsevich (2008, p. 594) cite “prominent statisticians” for the view that such

tests (naming Komogorov-Smirnov as well as Levene’s (below)) are “fatally flawed” and

should “never be used.”

• Homoscedasticity (Greek for “having the same scatter”), also known as homogeneity of

variance, meant that the populations being compared had similar standard deviations.

McDonald (2009) appeared to indicate that a lack of such homogeneity (i.e.,

heteroscedasticity) would tend to make the two populations appear different when, in

fact, they were not. Beyond the intuitive eyeballing of standard deviations and dispersion

of values, graphs (e.g., boxplots) could aid in detection of heteroscedasticity. De Muth

(2006, p. 174) alleged a rule of thumb by which one could assume homogeneity if the

largest variance, among samples being compared, was less than twice the size of the

smallest. (In an alternate version of that rule, Howell (1997, p. 321) proposed 4x rather

than 2x.) Formal tests of homoscedasticity included the F test, Levene’s test, and

Bartlett’s test. Subject to the warning of Erceg-Hurn and Morsevich (2008) (above), the

F test appeared especially common and was available in Excel’s FTEST function.

According to that Northwestern site and others, however, the F test required a high

degree of normality. McDonald indicated that Bartlett’s was more powerful than

Levene’s if the data were approximately normal, and that Levene’s was less sensitive to

nonnormality. McDonald offered a spreadsheet that performed Bartlett’s test; Winner

offered one for Levene’s.

Violation of these assumptions would apparently leave several options: the researcher could

ignore the violation and proceed with a parametric test, if the test was robust with respect to the

Page 8: Choosing the Right Statistical Test

8

violation in question; the researcher could choose a nonparametric test that did not make the

relevant parametric assumption(s); or the researcher could transform the data to render it

amenable to parametric analysis. Here, again, Erceg-Hurn and Morsevich (2008, pp. 594-595)

criticized transformations on numerous grounds (e.g., the transformation may not solve the

problem; data interpretation is difficult because subsequent calculations are based on

transformed rather than original data, accord Northwestern) and also suggested that classic

nonparametric methods have been superseded. The following discussion begins with the first of

those options: a closer inspection of parametric tests.

Parametric Analysis Decision Trees

As noted above, the type of data being collected (above) would significantly affect the choice of

statistical test. Wadsworth (2005) provided decision trees for nominal, ordinal, and “scale” data,

defining the last as including “dependent variables measured on approximately interval, interval,

or ratio scales.” A source in the psychology department at Emory University defined

“approximately” (also sometimes called quasi-) interval scales as

scales that are created from a series of likert rating [sic] (ordinal scale of measurement)

by either adding up the individual responses or calculating an average rating across items.

Technically, these scales are measured at the ordinal scale of measurement but many

researchers treat them as interval scales.

Wadsworth identified eight parametric options for scale data: four for cases involving only one

sample, two for cases involving two samples, and two for cases involving more than two

samples. Wadsworth also identified nine nonparametric options for nominal and ordinal data.

Other sources (e.g., Horn, n.d.; Garner, n.d.; SFSU, n.d.; Gardener, n.d.) offered alternate

decision trees or lists. Generally, it seemed that these criteria (i.e., type of data, number of

samples) would be among the first questions asked (along with whether the study was one of

correlation or difference among samples). It also appeared that various decision trees – as well

as the tables of contents of various statistics texts (e.g., Gravetter & Wallnau, 2000; Kuzma &

Bohnenblust, 2004) – tended to point toward substantially the same list of primary parametric

tests. Yet this raised a different question. Ng (2008) and Erceg-Hurn and Morsevich (2008)

criticized the use of the old, classic tests appearing in such lists. In the words of Wilcox (2002),

To put it simply, all of the hypothesis testing methods taught in a typical introductory

statistics course, and routinely used by applied researchers, are obsolete; there are no

exceptions. Hundreds of journal articles and several books point this out, and no

published paper has given a counter argument as to why we should continue to be

satisfied with standard statistical techniques. These standard methods include Student's T

for means, Student's T for making inferences about Pearson's correlation, and the

ANOVA F, among others.

If that was so, there was a question as to why one would begin with a t test – or with any of the

others in the classic lists, for that matter. One answer, suggested by the preceding paragraphs,

was that one might be limited by what one tended to find in classrooms, in online discussions,

and in textbooks (but see texts cited by Erceg-Hurn & Morsevich, 2008, pp. 596-597, e.g.,

Page 9: Choosing the Right Statistical Test

9

Wilcox, 2003). A related answer was that the classic methods had been researched and explored

extensively, and had an advantage in that regard for conservatively minded researchers: such

methods might to be obsolete in the sense of being relatively weak or limited, but were

presumably not terribly obsolete in the sense of being wrong. Another answer was that, at least

at the learning stage, it behooved one to learn what everyone else was using, so as to be

employable and basically conversant in one’s field – and thereafter inertia would take over and

people would tend to stay with that much because, after all, most were not statisticians. That

answer suggested yet another: that the objection that people were using inept old methods was

akin to the objection that, in many instances, people were using those methods poorly.

It did seem that consumers as distinct from producers of research (e.g., the people taking

introductory statistics courses, as distinct from some of their instructors) would need to respect

the old methods for the foreseeable future. On this impression, it seemed appropriate, here, to

follow the common approach of starting with z and t scores.

z Scores

The z score for a datum (x) was calculated with a fairly simple formula: z = (x – µ) / σ. The idea

was to see how many times the standard deviation (σ) would go into the difference between the

data point and the mean. That is, how far away from the mean was this data point, as measured

in terms of standard deviations?

The point of the z score was simply to convert raw data into standardized data. In a z

standardization, raw values were converted to z scores, where the mean was always zero and the

standard deviation was always 1.0. This had practical value. Ordinarily, for instance, it would

not be clear how one should interpret a score of 77 on an exam, or a fruit non-spoilage rate of

63%. But the information that such scores were two standard deviations below the mean would

indicate that these values were fairly unusual.

Even then, however, the situation would not be as clear as one might like. The mean score (�̅)

on that exam could have been an 87, with a population standard deviation (σ) of 5, or it could

have been 79, with a standard deviation of 1. In the latter case, a student with a 77 would

probably be pretty much with the group; not so in the former. Yet even a 77 (�̅ = 87, σ = 5)

could have a shred of respect in a skewed distribution – if, say, some scored as low as 30 or 40.

Those sorts of uncertainties would be resolved in the case of a normal (i.e., 68-95-99.7 (see

above)) distribution. Normal distributions were the place where z scores worked. If one knew

that grades on the exam were normally distributed, it would be possible to calculate the precise

percentage of grades falling more than two standard deviations below the mean (to calculate, that

is, the “percentile” of a 77) without access to the list of grades. A z table or calculation could be

found at the back of a statistics text and in some calculators and computer programs. As

distributions became less normal, however, z scores would provide less accurate estimates

(Ryan, 2007, p. 93).

According to the z table, in a normal distribution, a raw score precisely two standard deviations

below the mean was 47.72% away from the mean. That is, about 95.4% of all raw scores would

Page 10: Choosing the Right Statistical Test

10

be within two standard deviations above or below the mean: about 2.3% of scores would be

more than two standard deviations above the mean, and another 2.3% would be more than two

standard deviations below the mean. So if the grades on this exam were normally distributed, a

77 would be (rounding upwards) at the 3rd percentile: 2.3% of scores would be lower than 77.

This ability to calculate percentages without knowing specific scores faded as a distribution

became less than perfectly normal. But it was still possible to calculate percentiles, at least, with

a nonnormal distribution. Doing so required ranking the scores and finding the desired point, as

in the calculation of a median (i.e., the 50th percentile). For example, the 20th percentile

occurred at the value where 20% of all scores were lower.

The example just given involved a data point two standard deviations from the mean. What if

the score had been 78 instead of 77? This would require calculation of the z score, using the

formula noted above (i.e., z = (x – µ)/σ). In a normal distribution, a raw score of 77, two standard

deviations below the mean, would have a z score of 2.0. A raw score of 78 would be one and

four-fifths of a standard deviation below the mean; thus the calculation would give z = 1.8. In a z

table, a z value of 1.8 indicated that 46.41% of all scores fell between z (inclusive) and the mean.

Of course, another 50% of the scores lay on the other side of the mean, above 87. So 96.41% of

scores would be at or above 78, leaving 3.59% below 78. So a 78 would be at about the 4th

percentile. In short, by converting raw scores into standardized scores, z scores made it possible

to compare the relative standing of raw scores in dissimilar contexts (comparing e.g., a score of

77 out of 100 on one test against a score of 35 out of 45 on another test) – as long as the

distribution was normal.

The Central Limit Theorem

and the Distribution of Sample Means

Researchers did not typically seek and derive conclusions from a single raw score. They were

more likely to collect a sample of scores and calculate a mean. In this way – especially if they

selected a random sample – their work would tend to reflect different experiences or

observations. Their conclusions were thus less at risk of veering off into a special case of no

particular importance.

Just as a single raw value could be compared against the population containing all raw values, a

single mean value could be compared against the population of mean values. More precisely,

just as the raw value could be compared against the mean of raw values, a mean could be

compared against the mean of means. The latter comparison would be made specifically against

the mean of means of all possible samples of the same size. From a population of 1,000 raw

values, for instance, one could draw a very large but not infinitely large number of samples

containing 30 raw values each, calculate the mean of each such sample, and compare the mean of

just one of those samples against the mean of the means of all samples of that size.

Moreover, just as raw values might be distributed normally, so also the distribution of those

sample means (also known as the “sampling distribution of the mean”) might be normal. In fact,

the distribution of sample means became increasingly normal as the sample size and/or the

number of samples increased. That was true especially in samples of moderate to large size (n ≥

Page 11: Choosing the Right Statistical Test

11

30 – or less, in the case of a normal distribution – to n ≥ 500 or more, for a very nonnormal

population distribution; e.g., Chang, Huang, & Wu, 2006). In other words, the distribution of

sample means could be normal even if the underlying raw data distribution was not. And in fact,

the distribution of sample means would approach normality as one increased the number of

samples and/or the size of the samples.

That principle of increasing normality was known as the Central Limit Theorem, first published

by de Moivre in 1733. The Central Limit Theorem yielded some related insights. One was that,

not surprisingly, the mean of the distribution of all possible sample means of a given size (µ) was equal to the mean of the underlying raw values in a population (i.e., µ = µ). The other, less

obvious, was that the standard deviation of the distribution of sample means (called the “standard

error of the mean”) (σ = σ ÷ √n) was much less than the standard deviation of the underlying

distribution of raw values. That is, a sample mean would tend to be closer to the population

mean than a raw value would be. That was because samples would tend to be moderated by raw

values from both above and below the mean: an extreme sample mean was much less likely than

an extreme single data point. This moderating tendency increased with sample sizes until, of

course, a sample as large as the whole population would have a mean equal to that of the

population. So a single large sample might provide a fair “point estimate” of an unknown

population mean.

Graphically, the distribution of sample means might look leptokurtic rather than normal, when

presented on the same scale as the raw data of the population. The fact that it was normal meant

that, regardless of the distribution of raw values in the population, the researcher could use the z

table and the standard error (σ), relying on the normality of this distribution of sample means, to

calculate the distance of his/her sample mean from a known population mean, or to estimate its

distance from an unknown population mean.

The t Distribution

Calculations of z assumed that the population’s standard deviation (σ) was known. This

assumption was usually “unrealistic” (Bernstein, 1999, p. 207). Instead, it was commonly

necessary to estimate σ by substituting the sample’s standard deviation (s). One would thus use s

directly in the z table for raw scores, and would calculate s (instead of σ) = s (instead of σ) ÷ √n

when using the z table to work with the distribution of sample means.

That substitution would alter the equation, since s and σ were not calculated in exactly the same

way. The difference was as follows:

� = �∑����� but = �∑���������

Because of the difference in denominators (i.e., number of items in the population (N) versus

items in the sample (n) minus 1), s would not have the same value as σ even if the sample mean

(��) and population mean (µ) were the same. Hence, the normal distribution, and the values that

would work with σ, in the z table, would not work with s. Substituting s for σ would thus require

Page 12: Choosing the Right Statistical Test

12

an alternative to the z distribution. This alternative, the t distribution, was published by Gossett

(using the pen name of “Student”) and refined by Fisher (Eisenhart, 1979, p. 6). Like the normal

distribution, the t distribution (more precisely, the family of t distributions) was essentially a

formula whose results, if graphed, would produce a bell-shaped curve. For relatively large

samples (n ≥ 120) the t distribution was virtually equivalent to the normal distribution (Hinkle,

1994, p. 186). The two were closely similar even at n = 30 (Kuzma & Bohnenblust, 2004, p.

114). But as sample size shrank, the t distribution became more dispersed, reflecting less

confidence in the sample as an indicator of the population from which it was drawn – until, at

small values such as n = 3, the graph of the t distribution was a nearly flat line.

In the t distribution, the calculation to determine the distance of the sample mean from the

population mean was much like the calculation of the z score (above): t = (�̅ – µ) / s. As

indicated above, s = s ÷ √n. So t = (�̅ – µ) / (s ÷ √n). Just as z scores expressed the distance of

an individual raw value from the mean of raw values in terms of standard deviations, t scores

expressed the distance of an individual sample mean from the mean of all sample means in terms

of standard errors of the mean (s).

t Tests

As one would expect from a parametric statistical tool, t tests were used to learn about

populations. This occurred in several different settings. The one-sample t test would be used to

estimate a population mean or to compare a sample against a known population mean (De Muth,

2006, p. 174). The latter use would ask whether the sample appeared likely to be drawn from

that population. For example, the general population of 40-year-old men might have a certain

mean weight, whereas a sample of 40-year-old men who had received a certain treatment might

weigh three pounds less. The question for the t test would be whether that was a statistically

significant difference, such that the sample no longer seemed to represent the general population

of 40-year-old men, but instead seemed likely to represent a new population of treated 40-year-

old men.

While one-sample scenarios were commonly used for educational purposes, “most research

studies require the comparison of two (or more) sets of data” (Gravetter & Wallnau, 2000, p.

311). Two-sample t tests would compare the means of two samples against each other, treating

each as representative of a potentially distinct population, to determine whether their differences

from one another were statistically significant. There were two kinds of two-sample t tests.

Paired (also known as “dependent” or “related”) samples would involve matching of individual

data points. For example, the same person might be tested before and after a treatment.

Unpaired (also known as “independent”) samples would have no such relationship; for example,

the 40-year-old men cited in the previous paragraph would be in two separate (control and

treatment) groups, without any one-to-one correspondence or pairing between individual men.

In that example, the (generally nominal or ordinal) independent variable

would be the treatment – it would lead the inquiry, providing the basis for grouping – and the

thing being studied (e.g., weight) would be the (integer or ratio) dependent variable.