Upload
shrikant-i
View
212
Download
0
Embed Size (px)
Citation preview
ACCIDENTAL NOTE
How do I know if I am normal?
Shrikant I. Bangdiwalaa,b*
aDepartment of Biostatistics, UNC Gillings School of Global Public Health, Chapel Hill, NC, USA; bInstitute for Social and HealthSciences, University of South Africa, Johannesburg, South Africa
Introduction
In the injury field, most of the variables of interest are
count data � e.g. number of drownings in a given locale
in a given time period; number of pedestrian deaths in a
given area in a given time period. However, there are
some important variables that are measured on a continu-
ous scale � e.g. BAC in g 6 dl, speeds of vehicles in
km 6 hr, � or constructed indexes or scales that are treated
as continuous variables � e.g. injury severity score (ISS)
(Baker, O’Neill, Haddon, & Long, 1974), quality of life
scales. When analysing a variable measured on a continu-
ous scale, one is interested in describing its distribution �what is its range of values; are some values more common
than others; is the shape of the distribution skewed or
symmetric, peaked or flat? If the observed distribution fits
a particular family of well-characterised theoretical proba-
bility distributions, one could use the mathematical prop-
erties of that family of distributions to simplify
description and further analyses.
A commonly used family of probability distribution is
the Gaussian bell-shaped distribution, commonly referred
to as the normal distribution, since it was found to
‘normally’ describe the distribution of errors in measure-
ments. It is quite popular since it has useful mathematical
properties, because many other probability distributions
are approximately bell-shaped, and largely because,
regardless of a variable’s original distribution, the distri-
bution of means of samples taken from the original distri-
bution follow the Gaussian bell-shape in large samples.
Given its usefulness and widespread applicability, how
should one verify if the normal distribution fits one’s
data? This Accidental Note addresses this question using
ISS from the Texas EMS Trauma Registry (2004) deaths.
The family of normal distributions
In 1733, the French mathematician, Abraham deMoivre,
found a good approximation in large samples to the bino-
mial probability distribution that was extensively used at
the time in games of chance. In 1809, the German
mathematician, Carl Friedrich Gauss, developed a two-
parameter function based on the mathematical exponential
function while studying astronomical observation errors
(Stahl, 2006). Its widespread use for describing error dis-
tributions led the famous statistician, Karl Pearson, to call
it the normal distribution in 1893. Gauss’ function f(x) is
symmetric (so its mean equals its median), and it is fully
characterised by the two parameters � the mean m and
the standard deviation s:
fðxÞ ¼ 1
sffiffiffiffiffiffi
2pp e
¡ ðx¡mÞ22s2
Since it is symmetric, its asymmetry (skewness) is zero,
and it is characterised by having a peakedness (kurtosis)
of 3. It is often referred to as the bell-shaped distribution
as shown in Figure 1.
The mathematical properties of the normal distribu-
tion have also made it quite popular in developing statisti-
cal methods for comparing means among groups and for
studying linear relationships among variables (e.g. corre-
lation, regression).
Testing if data fit a member of the normal family of
distributions
There are many approaches to examine the goodness of fit
of a normal distribution to one’s data, some graphically
and some numerically. The box-and-whisker plot (see
Figure 2) displays the 25th, 50th and 75th percentile of
the observed distribution in a box, with ‘whiskers’ going
out towards the highest and lowest values observed. It
allows one to examine symmetry (lack of skewness). The
histogram displays the frequency distribution of values
observed in bars, and a simple graphical overlay of a nor-
mal curve (that shares the same mean and standard devia-
tion as observed) over the histogram (see Figure 3)
provides a visual fit. The Q�Q plot compares the quan-
tiles of the observed distribution against the quantiles of a
normal distribution (see Figure 4). All these graphical
*Email:[email protected]
� 2014 Taylor & Francis
International Journal of Injury Control and Safety Promotion, 2014
Vol. 21, No. 2, 199�201, http://dx.doi.org/10.1080/17457300.2014.924327
approaches allow for ‘visual inspection’ of the fit of the
observed distribution to normality.
If a numerical method along with a formal statistical
test is desired, there are many tests available, but the most
common methods are the D’Agostino�Pearson omnibus
test of skewness and kurtosis, the Shapiro�Wilk test and
the Kolmogorov�Smirnov test (D’Agostino, 1986). The
first focuses on departures from either symmetry or
peakedness of 3, the second on the Q�Q plot not follow-
ing a straight line and the latter one on the maximum
departure between the observed and the theoretical cumu-
lative distribution functions.
Illustrative example � injury severity scores
The ISS was developed as a ‘valid’ numerical measure for
evaluating emergency care and for describing the overall
severity of injury in persons with multiple injuries in
more than one area of the body (Baker et al., 1974). It
ranges from 1 to 75, but not all numbers in the range are
possible given the coding scheme. Despite this limitation,
it is often analysed as if it were truly underlying continu-
ous (Di Bartolomeo, Tillati, Valent, Zanier, & Barbone,
2010, Stevenson, Segui-Gomez, Lescohier, DiScala, &
McDonald-Smith, 2001). We used ISS scores of all deaths
submitted to the Texas EMS 6 Trauma Registry occurring
from 1 January 2004 to 31 December 2004.
The box plot (Figure 2) gives the visual impression of
symmetry, while the histogram (Figure 3(a)) shows the
discreteness of the ISS measure and also the somewhat
asymmetry mainly due to the large frequency of individu-
als assigned the maximum ISS value of 75. When the ISS
values are transformed by the square root, the resulting
histogram is more symmetric and the normal distribution
provides a ‘better’ visual fit (Figure 3(b)). The Q�Q plot
for the square root of ISS (Figure 4) follows closely the
straight line except in the upper and lower tails, due in
part to having relatively large numbers of individuals with
ISS values of 1 (116 individuals) and 75 (183 individuals),
causing the distribution to have ‘heavy tails’.
When conducting the formal tests, all three tests reject
the null hypothesis of normal distribution for ISS. For the
square root of ISS, which visually seemed to follow a nor-
mal distribution (Figure 3(b)), all but the D’Agostino�Pearson test for skewness and kurtosis rejected the null
hypothesis of normal distribution. These tests are quite
sensitive to any slight departure from normality, and thus
in large samples such as we have (n D 2765), their use is
not recommended.
Concluding remarks
When analysing continuous data or indexes that have a
broad range and could be treated as if they are continuous,
one must take care to ensure that they do fit the probability
distribution that one wishes they follow, prior to using
Figure 1. Shape of the bell-shaped ‘normal’ distribution, with per cent of cases in eight subsections of the probability distribution.
Figure 2. Box plot of injury severity scores (ISS) for all deathssubmitted to the Texas EMS6 Trauma Registry occurring from1 January 2004 to 31 December 2004.
200 S.I. Bangdiwala
statistical methods based on that probability distribution.
If one’s data do not fit the normal distribution, methods
that do not require normality can be used (e.g. non-
parametric or robust methods), or the data can be trans-
formed (e.g. taking logarithms or square roots). Visual
inspections to assess whether the assumption of normality
fits the data are adequate methods, even if not
quantifiable. They can be complemented with formal test-
ing procedures, applied with caution in large samples.
References
Baker, S.P., O’Neill, B., Haddon, W., & Long, W.B. (1974). Theinjury severity score: A method for describing patients withmultiple injuries and evaluating emergency care. The Jour-nal of Trauma, 14(3), 187�196.
D’Agostino, R.B. (1986). Tests for normal distribution. In R.B.D’Agostino & M.A. Stephens (Eds.), Goodness-of-fit techni-ques (pp. 367�420). New York, NY: Marcel Dekker.
Di Bartolomeo, S., Tillati, S., Valent, F., Zanier, L., & Barbone,F. (2010). ISS mapped from ICD-9-CM by a novel freewareversus traditional coding: A comparative study. Scandina-vian Journal of Trauma, Resuscitation and Emergency Med-icine, 18, 17. doi:10.1186/1757-7241-18-17
Stahl, S.L. (2006). The evolution of the normal distribution.Mathematics Magazine, 79(2), 96�113.
Stevenson, M., Segui-Gomez, M., Lescohier, I., DiScala, C., &McDonald-Smith, G. (2001). An overview of the injuryseverity score and the new injury severity score. Injury Pre-vention, 7, 10�13.
Texas EMS Trauma Registry. (2004). Data. Retrieved fromhttps://www.dshs.state.tx.us/emstraumasystems/DisproYearlyRequirementsOutlineGeneric.pdf
Figure 3. Histogram and overlaid fitted normal distributioncurve of (a) injury severity scores (ISS) and (b) the square rootof ISS (sqrtISS) for all deaths submitted to the Texas EMS 6 -Trauma Registry occurring from 1 January 2004 to 31 December2004.
Figure 4. Quantile�quantile (Q�Q) plot of the square root ofinjury severity scores (sqrtISS) for all deaths submitted to theTexas EMS6 Trauma Registry occurring from 1 January 2004 to31 December 2004.
International Journal of Injury Control and Safety Promotion 201