3
ACCIDENTAL NOTE How do I know if I am normal? Shrikant I. Bangdiwala a,b * a Department of Biostatistics, UNC Gillings School of Global Public Health, Chapel Hill, NC, USA; b Institute for Social and Health Sciences, University of South Africa, Johannesburg, South Africa Introduction In the injury field, most of the variables of interest are count data e.g. number of drownings in a given locale in a given time period; number of pedestrian deaths in a given area in a given time period. However, there are some important variables that are measured on a continu- ous scale e.g. BAC in g6 dl, speeds of vehicles in km6 hr, or constructed indexes or scales that are treated as continuous variables e.g. injury severity score (ISS) (Baker, O’Neill, Haddon, & Long, 1974), quality of life scales. When analysing a variable measured on a continu- ous scale, one is interested in describing its distribution what is its range of values; are some values more common than others; is the shape of the distribution skewed or symmetric, peaked or flat? If the observed distribution fits a particular family of well-characterised theoretical proba- bility distributions, one could use the mathematical prop- erties of that family of distributions to simplify description and further analyses. A commonly used family of probability distribution is the Gaussian bell-shaped distribution, commonly referred to as the normal distribution, since it was found to ‘normally’ describe the distribution of errors in measure- ments. It is quite popular since it has useful mathematical properties, because many other probability distributions are approximately bell-shaped, and largely because, regardless of a variable’s original distribution, the distri- bution of means of samples taken from the original distri- bution follow the Gaussian bell-shape in large samples. Given its usefulness and widespread applicability, how should one verify if the normal distribution fits one’s data? This Accidental Note addresses this question using ISS from the Texas EMS Trauma Registry (2004) deaths. The family of normal distributions In 1733, the French mathematician, Abraham deMoivre, found a good approximation in large samples to the bino- mial probability distribution that was extensively used at the time in games of chance. In 1809, the German mathematician, Carl Friedrich Gauss, developed a two- parameter function based on the mathematical exponential function while studying astronomical observation errors (Stahl, 2006). Its widespread use for describing error dis- tributions led the famous statistician, Karl Pearson, to call it the normal distribution in 1893. Gauss’ function f(x) is symmetric (so its mean equals its median), and it is fully characterised by the two parameters the mean m and the standard deviation s: fðxÞ¼ 1 s ffiffiffiffiffi 2p p e ¡ ðx ¡ mÞ 2 2s 2 Since it is symmetric, its asymmetry (skewness) is zero, and it is characterised by having a peakedness (kurtosis) of 3. It is often referred to as the bell-shaped distribution as shown in Figure 1. The mathematical properties of the normal distribu- tion have also made it quite popular in developing statisti- cal methods for comparing means among groups and for studying linear relationships among variables (e.g. corre- lation, regression). Testing if data fit a member of the normal family of distributions There are many approaches to examine the goodness of fit of a normal distribution to one’s data, some graphically and some numerically. The box-and-whisker plot (see Figure 2) displays the 25th, 50th and 75th percentile of the observed distribution in a box, with ‘whiskers’ going out towards the highest and lowest values observed. It allows one to examine symmetry (lack of skewness). The histogram displays the frequency distribution of values observed in bars, and a simple graphical overlay of a nor- mal curve (that shares the same mean and standard devia- tion as observed) over the histogram (see Figure 3) provides a visual fit. The QQ plot compares the quan- tiles of the observed distribution against the quantiles of a normal distribution (see Figure 4). All these graphical *Email:[email protected] Ó 2014 Taylor & Francis International Journal of Injury Control and Safety Promotion, 2014 Vol. 21, No. 2, 199201, http://dx.doi.org/10.1080/17457300.2014.924327

How do I know if I am normal?

Embed Size (px)

Citation preview

ACCIDENTAL NOTE

How do I know if I am normal?

Shrikant I. Bangdiwalaa,b*

aDepartment of Biostatistics, UNC Gillings School of Global Public Health, Chapel Hill, NC, USA; bInstitute for Social and HealthSciences, University of South Africa, Johannesburg, South Africa

Introduction

In the injury field, most of the variables of interest are

count data � e.g. number of drownings in a given locale

in a given time period; number of pedestrian deaths in a

given area in a given time period. However, there are

some important variables that are measured on a continu-

ous scale � e.g. BAC in g 6 dl, speeds of vehicles in

km 6 hr, � or constructed indexes or scales that are treated

as continuous variables � e.g. injury severity score (ISS)

(Baker, O’Neill, Haddon, & Long, 1974), quality of life

scales. When analysing a variable measured on a continu-

ous scale, one is interested in describing its distribution �what is its range of values; are some values more common

than others; is the shape of the distribution skewed or

symmetric, peaked or flat? If the observed distribution fits

a particular family of well-characterised theoretical proba-

bility distributions, one could use the mathematical prop-

erties of that family of distributions to simplify

description and further analyses.

A commonly used family of probability distribution is

the Gaussian bell-shaped distribution, commonly referred

to as the normal distribution, since it was found to

‘normally’ describe the distribution of errors in measure-

ments. It is quite popular since it has useful mathematical

properties, because many other probability distributions

are approximately bell-shaped, and largely because,

regardless of a variable’s original distribution, the distri-

bution of means of samples taken from the original distri-

bution follow the Gaussian bell-shape in large samples.

Given its usefulness and widespread applicability, how

should one verify if the normal distribution fits one’s

data? This Accidental Note addresses this question using

ISS from the Texas EMS Trauma Registry (2004) deaths.

The family of normal distributions

In 1733, the French mathematician, Abraham deMoivre,

found a good approximation in large samples to the bino-

mial probability distribution that was extensively used at

the time in games of chance. In 1809, the German

mathematician, Carl Friedrich Gauss, developed a two-

parameter function based on the mathematical exponential

function while studying astronomical observation errors

(Stahl, 2006). Its widespread use for describing error dis-

tributions led the famous statistician, Karl Pearson, to call

it the normal distribution in 1893. Gauss’ function f(x) is

symmetric (so its mean equals its median), and it is fully

characterised by the two parameters � the mean m and

the standard deviation s:

fðxÞ ¼ 1

sffiffiffiffiffiffi

2pp e

¡ ðx¡mÞ22s2

Since it is symmetric, its asymmetry (skewness) is zero,

and it is characterised by having a peakedness (kurtosis)

of 3. It is often referred to as the bell-shaped distribution

as shown in Figure 1.

The mathematical properties of the normal distribu-

tion have also made it quite popular in developing statisti-

cal methods for comparing means among groups and for

studying linear relationships among variables (e.g. corre-

lation, regression).

Testing if data fit a member of the normal family of

distributions

There are many approaches to examine the goodness of fit

of a normal distribution to one’s data, some graphically

and some numerically. The box-and-whisker plot (see

Figure 2) displays the 25th, 50th and 75th percentile of

the observed distribution in a box, with ‘whiskers’ going

out towards the highest and lowest values observed. It

allows one to examine symmetry (lack of skewness). The

histogram displays the frequency distribution of values

observed in bars, and a simple graphical overlay of a nor-

mal curve (that shares the same mean and standard devia-

tion as observed) over the histogram (see Figure 3)

provides a visual fit. The Q�Q plot compares the quan-

tiles of the observed distribution against the quantiles of a

normal distribution (see Figure 4). All these graphical

*Email:[email protected]

� 2014 Taylor & Francis

International Journal of Injury Control and Safety Promotion, 2014

Vol. 21, No. 2, 199�201, http://dx.doi.org/10.1080/17457300.2014.924327

approaches allow for ‘visual inspection’ of the fit of the

observed distribution to normality.

If a numerical method along with a formal statistical

test is desired, there are many tests available, but the most

common methods are the D’Agostino�Pearson omnibus

test of skewness and kurtosis, the Shapiro�Wilk test and

the Kolmogorov�Smirnov test (D’Agostino, 1986). The

first focuses on departures from either symmetry or

peakedness of 3, the second on the Q�Q plot not follow-

ing a straight line and the latter one on the maximum

departure between the observed and the theoretical cumu-

lative distribution functions.

Illustrative example � injury severity scores

The ISS was developed as a ‘valid’ numerical measure for

evaluating emergency care and for describing the overall

severity of injury in persons with multiple injuries in

more than one area of the body (Baker et al., 1974). It

ranges from 1 to 75, but not all numbers in the range are

possible given the coding scheme. Despite this limitation,

it is often analysed as if it were truly underlying continu-

ous (Di Bartolomeo, Tillati, Valent, Zanier, & Barbone,

2010, Stevenson, Segui-Gomez, Lescohier, DiScala, &

McDonald-Smith, 2001). We used ISS scores of all deaths

submitted to the Texas EMS 6 Trauma Registry occurring

from 1 January 2004 to 31 December 2004.

The box plot (Figure 2) gives the visual impression of

symmetry, while the histogram (Figure 3(a)) shows the

discreteness of the ISS measure and also the somewhat

asymmetry mainly due to the large frequency of individu-

als assigned the maximum ISS value of 75. When the ISS

values are transformed by the square root, the resulting

histogram is more symmetric and the normal distribution

provides a ‘better’ visual fit (Figure 3(b)). The Q�Q plot

for the square root of ISS (Figure 4) follows closely the

straight line except in the upper and lower tails, due in

part to having relatively large numbers of individuals with

ISS values of 1 (116 individuals) and 75 (183 individuals),

causing the distribution to have ‘heavy tails’.

When conducting the formal tests, all three tests reject

the null hypothesis of normal distribution for ISS. For the

square root of ISS, which visually seemed to follow a nor-

mal distribution (Figure 3(b)), all but the D’Agostino�Pearson test for skewness and kurtosis rejected the null

hypothesis of normal distribution. These tests are quite

sensitive to any slight departure from normality, and thus

in large samples such as we have (n D 2765), their use is

not recommended.

Concluding remarks

When analysing continuous data or indexes that have a

broad range and could be treated as if they are continuous,

one must take care to ensure that they do fit the probability

distribution that one wishes they follow, prior to using

Figure 1. Shape of the bell-shaped ‘normal’ distribution, with per cent of cases in eight subsections of the probability distribution.

Figure 2. Box plot of injury severity scores (ISS) for all deathssubmitted to the Texas EMS6 Trauma Registry occurring from1 January 2004 to 31 December 2004.

200 S.I. Bangdiwala

statistical methods based on that probability distribution.

If one’s data do not fit the normal distribution, methods

that do not require normality can be used (e.g. non-

parametric or robust methods), or the data can be trans-

formed (e.g. taking logarithms or square roots). Visual

inspections to assess whether the assumption of normality

fits the data are adequate methods, even if not

quantifiable. They can be complemented with formal test-

ing procedures, applied with caution in large samples.

References

Baker, S.P., O’Neill, B., Haddon, W., & Long, W.B. (1974). Theinjury severity score: A method for describing patients withmultiple injuries and evaluating emergency care. The Jour-nal of Trauma, 14(3), 187�196.

D’Agostino, R.B. (1986). Tests for normal distribution. In R.B.D’Agostino & M.A. Stephens (Eds.), Goodness-of-fit techni-ques (pp. 367�420). New York, NY: Marcel Dekker.

Di Bartolomeo, S., Tillati, S., Valent, F., Zanier, L., & Barbone,F. (2010). ISS mapped from ICD-9-CM by a novel freewareversus traditional coding: A comparative study. Scandina-vian Journal of Trauma, Resuscitation and Emergency Med-icine, 18, 17. doi:10.1186/1757-7241-18-17

Stahl, S.L. (2006). The evolution of the normal distribution.Mathematics Magazine, 79(2), 96�113.

Stevenson, M., Segui-Gomez, M., Lescohier, I., DiScala, C., &McDonald-Smith, G. (2001). An overview of the injuryseverity score and the new injury severity score. Injury Pre-vention, 7, 10�13.

Texas EMS Trauma Registry. (2004). Data. Retrieved fromhttps://www.dshs.state.tx.us/emstraumasystems/DisproYearlyRequirementsOutlineGeneric.pdf

Figure 3. Histogram and overlaid fitted normal distributioncurve of (a) injury severity scores (ISS) and (b) the square rootof ISS (sqrtISS) for all deaths submitted to the Texas EMS 6 -Trauma Registry occurring from 1 January 2004 to 31 December2004.

Figure 4. Quantile�quantile (Q�Q) plot of the square root ofinjury severity scores (sqrtISS) for all deaths submitted to theTexas EMS6 Trauma Registry occurring from 1 January 2004 to31 December 2004.

International Journal of Injury Control and Safety Promotion 201