Use of Statistics by Scientist

Embed Size (px)

Citation preview

  • 8/9/2019 Use of Statistics by Scientist

    1/22

    3LCGC Europe Online Supplement statistics and data analysis

    In this article we look at the initial steps in

    data analysis (i.e., exploratory data analysis),

    and how to calculate the basic summary

    statistics (the mean and sample standarddeviation). These two processes, which

    increase our understanding of the data

    structure, are vital if the correct selection of

    more advanced statistical methods and

    interpretation of their results are to be

    achieved. From that base we will progress to

    significance testing (t-tests and the F-test).

    These statistics allow a comparison between

    two sets of results in an objective and

    unbiased way. For example, significance

    tests are useful when comparing a new

    analytical method with an old method or

    when comparing the current daysproduction with that of the previous day.

    Exploratory Data Analysis

    Exploratory data analysis is a term used to

    describe a group of techniques (largely

    graphical in nature) that sheds light on the

    structure of the data. Without this

    knowledge the scientist, or anyone else,

    cannot be sure they are using the correct

    form of statistical evaluation.

    The statistics and graphs referred to in this

    first section are applicable to a single

    column of data (i.e., univariate data), suchas the number of analyses performed in a

    laboratory each month. For small amounts

    of data (

  • 8/9/2019 Use of Statistics by Scientist

    2/22

    LCGC Europe Online Supplement4 statistics and data analysis

    obtained). If any systematic trends are

    observed (Figures 3(a)3(c)) then the

    reasons for this must be investigated.

    Normal statistical methods assume a

    random distribution about the mean with

    time (Figure 3(d)) but if this is not the case

    the interpretation of the statistics can be

    erroneous.

    Summary Statistics

    Summary statistics are used to make sense

    of large amounts of data. Typically, the

    mean, sample standard deviation, range,

    confidence intervals, quantiles (1), and

    measures for skewness and

    spread/peakedness of the distribution

    (kurtosis) are reported (2). The mean and

    sample standard deviation are the most

    widely used and are discussed below

    together with how they relate to the

    confidence intervals for normally

    distributed data.

    The Mean

    The average or arithmetic mean (3) is

    generally the first statistic everyone is

    taught to calculate. This statistic is easilyfound using a calculator or spreadsheet

    and simply involves the summing of the

    individual results x1, x2, x3, ..., xi) and

    division by the number of results (n),

    where,

    n

    i1

    x1x2 x3 xi

    x

    n

    i1

    xi

    n

    Unfortunately, the mean is often reported

    as an estimate of the true-value (m) of

    whatever is being measured without

    considering the underlying distribution.

    This is a mistake. Before any statistic is

    calculated it is important that the raw data

    should be carefully scrutinized and plotted

    as described above. An outlying point canhave a big effect on the mean (compare

    Figure 1(a) with 1(b)).

    The Standard Deviation (3)

    The standard deviation is a measure of the

    spread of data (dispersion) about the mean

    and can again be calculated using a

    calculator or spreadsheet. There is,

    however, a slight added complication; if

    you look at a typical scientific calculator

    you will notice there are two types of

    Box 1: Stem-and-leaf plot

    A stem-and-leaf plot is anothermethod of examining patterns in thedata set. They show the range, inwhich the values are concentrated,and the symmetry. This type of plot isconstructed by splitting data into thestem (the leading digits). In the figurebelow, this is from 0.1 to 0.6, andthe leaf (the trailing digit). Thus,

    0.216 is represented as 2|1 and0.350 by 3|5. Note, the decimalplaces are truncated and not round-ed in this type of plot. Reading theplot below, we can see that the datavalues range from 0.12 to 0.63. Thecolumn on the left contains thedepth information (i.e., how manyleaves lie on the lines closest to theend of the range). Thus, there are 13points which lie between 0.40 and0.63. The line containing the middlevalue is indicated differently with a

    count (the number of items in theline) and is enclosed in parentheses.

    Stem-and-leaf plot

    Units = 0.1 1|2 = 0.12 Count =

    42

    5 1|22677

    14 2|112224578

    (15) 3|000011122333355

    13 4|0047889

    6 5|56669

    1 6|3

    *outlier

    (a) (b)

    upper quartile value

    interquartile

    median

    lower quartile value

    1.5 interquartile

    1.5 interquartile

    *The interquartile range is the range which contains the middle 50% of the data whenit is sorted into ascending order.

    Fre

    quency(Nofdata

    pointsin

    each

    bar)

    figure 2 Frequency histogram and Box and Whisker plot.

    (a)

    Magnitude

    10

    8

    6

    4

    2

    0n = 7, mean = 6, standard deviation = 2.16

    (b)

    Magnitude

    10

    8

    6

    4

    2

    0n = 9, mean = 6, standard deviation = 2.65

    (c)

    Magnitude

    10

    8

    6

    4

    2

    0

    n = 9, mean = 6, standard deviation = 2.06

    (d)

    Magnitude

    10

    8

    6

    4

    2

    0

    n = 9, mean = 6, standard deviation = 1.80

    Time

    Time

    Time

    Time

    figure 3 Time-indexed plots.

  • 8/9/2019 Use of Statistics by Scientist

    3/22

  • 8/9/2019 Use of Statistics by Scientist

    4/22

    LCGC Europe Online Supplement6 statistics and data analysis

    Significance Testing

    Suppose, for example, we have the

    following two sets of results for lead

    content in water 17.3, 17.3, 17.4, 17.4

    and 18.5, 18.6, 18.5, 18.6. It is fairly clear,

    by simply looking at the data, that the two

    sets are different. In reaching this

    conclusion you have probably consideredthe amount of data, the average for each

    set and the spread in the results. The

    difference between two sets of data is,

    however, not so clear in many situations.

    The application of significance tests gives

    us a more systematic way of assessing the

    results with the added advantage of

    allowing us to express our conclusion with

    a stated degree of confidence.

    What does significance mean?

    In statistics the words significant and

    significance have specific meanings. Asignificant difference, means a difference

    that is unlikely to have occurred by chance.

    A significance test, shows up differences

    unlikely to occur because of a purely

    random variation.

    As previously mentioned, to decide if one

    set of results is significantly different from

    another depends not only on the

    magnitude of the difference in the means

    but also on the amount of data available

    and its spread. For example, consider the

    blob plots shown in Figure 5. For the two

    data sets shown in Figure 5(a), the means

    for set (i) and set (ii) are numerically

    different. From the limited amount of

    information available, however, they are

    from a statistical point of view the same.

    For Figure 5(b), the means for set (i) andset (ii) are probably different but when

    fewer data points are available, Figure 5(c),

    we cannot be sure with any degree of

    confidence that the means are different

    even if they are a long way apart. With a

    large number of data points, even a very

    small difference, can be significant (Figure

    5(d)). Similarly, when we are interested in

    comparing the spread of results, for

    example, when we want to know if

    method (i) gives more consistent results

    than method (ii), we have to take note of

    the amount of information available(Figures 5(e)(g)).

    It is fortunate that tables are published

    that show how large a difference needs to

    be before it can be considered not to have

    occurred by chance. These are, critical

    t-value for differences between means,

    and critical F-values for differences

    between the spread of results (4).

    Note: Significance is a function of sample

    size. Comparing very large samples will

    nearly always lead to a significant

    difference but a statistically significant

    result is not necessarily an important result.

    For example in Figure 5(d) there is a

    statistically significant difference, but does

    it really matter in practice?

    What is a t-test?A t-test is a statistical procedure that can

    be used to compare mean values. A lot of

    jargon surrounds these tests (see Table 1

    for definition of the terms used below) but

    they are relatively simple to apply using the

    built-in functions of a spreadsheet like

    Excel or a statistical software package.

    Using a calculator is also an option but you

    have to know the correct formula to apply

    (see Table 2) and have access to statistical

    tables to look up the so-called critical

    values (4).

    Three worked examples are shown inBox 2 (5) to illustrate how the different

    t-tests are carried out and how to interpret

    the results.

    What is an F-test?

    An F-test compares the spread of results in

    two data sets to determine if they could

    reasonably be considered to come from the

    same parent distribution. The test can,

    therefore, be used to answer questions

    such as are two methods equally precise?

    The measure of spread used in the F-test is

    variance which is simply the square of thestandard deviation. The variances are

    ratioed (i.e., divide the variance of one set

    of data by the variance, of the other) to

    get the test value F =

    This F value is then compared with a critical

    value that tells us how big the ratio needs

    to be to rule out the difference in spread

    occurring by chance. The Fcrit value is

    found from tables using (n11) and (n21)

    degrees of freedom, at the appropriate

    level of confidence.[Note: it is usual to arrange s1 and s2 so

    that F > 1]. If the standard deviations are to

    be considered to come from the same

    population then Fcrit > F. As an example we

    use the data in Example 2 (see Box 2).

    Fcrit = 9.605 (51) and (51) degrees of

    freedom at the 97.5% confidence level.

    As Fcrit> Fcalculated we can conclude that the

    spread of results in the two data sets are

    not significantly different and it is,

    therefore, reasonable to combine the two

    standard deviations as we have done.

    F2.752

    1.471 22.75 2

    1.471 23.49

    S 12

    S 22S 1

    2

    S 22

    table 1 Definitions of statistical terms used in significance testing.

    Jargon Definition

    Alternate Hypothesis A statement describing the alternative to the null hypothesis(H1) (i.e., there is a difference between the means [see two-tailed]

    or mean1 is mean2 [see one-tailed]).

    Critical Value The value obtained from statistical tables or statistical packages at a(tcrit or Fcrit) given confidence level against which the result of applying a signifi-

    cance test is compared.

    Null hypothesis A statement describing what is being tested(H0) (i.e., there is no difference between the two means [mean1 = mean2]).

    One-tailed A one-tailed test is performed if the analyst is only interested in theanswer when the result is different in one direction, for example, (1)

    thenew production method results in a higher yield, or (2) the amount ofwaste product is reduced (i.e., a limit value , >,

  • 8/9/2019 Use of Statistics by Scientist

    5/22

  • 8/9/2019 Use of Statistics by Scientist

    6/22

    LCGC Europe Online Supplement8 statistics and data analysis

    Box 2

    Example 1

    A chemist is asked to validate a neweconomic method of derivatization

    before analysing a solution by a standardgas chromatography method. The long-term mean for the check samples usingthe old method is 22.7 g/L. For the newmethod the mean is 23.5 g/L, based on10 results with a standard deviation of0.9 g/L. Is the new method equivalentto the old? To answer this question weuse the t-test to compare the two meanvalues. We start by stating exactly whatwe are trying to decide, in the form oftwo alternative hypotheses; (i) the meanscould really be the same, or (ii) the

    means could really be different. Instatistical terminology this is written as: The null hypothesis (H0): new method

    mean = long-term check sample mean.

    The alternative hypothesis (H1): new

    method mean long-term check sample

    mean.

    To test the null hypothesis we calculatethe t-value as below. Note, the calculatedt-value is the ratio of the differencebetween the means and a measure ofthe spread (standard deviation) and theamount of data available (n).

    In the final step of the significance testwe compare the calculated t-value withthe critical t-value obtained from tables(4). To look up the critical value we needto know three pieces of information:

    (i) Are we interested in the directionof the difference between the twomeans or only that there is a difference,for example, are we performing a one-sided or two-sided t-test (see Table 1)?

    In the case above, it is the latter, there-fore, the two-sided critical value is used.(ii) The degrees of freedom: this is

    simply the number of data pointsminus one (n1).

    (iii) How certain do we want to beabout our conclusions? It is normalpractice in chemistry to select the 95%confidence level (i.e., about 1 in 20times we perform the t-test we couldarrive at an erroneous conclusion).However, in some situations this is anunacceptable level of error, such as in

    medical research. In these cases, the99% or even the 99.9% confidencelevel can be chosen.

    t23.522.70.9 / 10

    2.81

    tcrit = 2.26 at the 95% confidencelevel for 9 degrees of freedom.

    As tcalculated > tcrit we can reject the nullhypothesis and conclude that we are 95%certain that there is a significant differencebetween the new and old methods.

    [Note: This does not mean the newderivatization method should beabandoned. A judgement needs tobe made on the economics and onwhether the results are fit for purpose.

    The significance test is only one pieceof information to be considered.]

    Example 2 (5)

    Two methods for determining theconcentration of Selenium are to becompared. The results from eachmethod are shown in Table 3:

    Using the t-test for independentsample means we define the nullhypothesis H0 as x

    1 = x

    2

    This means there is no difference between

    the means of the two methods (the

    alternative hypothesis is H1: x1 x2). Ifthe two methods have sample standard

    deviations that are not significantly

    different then we can combine (or pool)

    the standard deviation (Sc).

    (see What is an F-Test?)

    If the standard deviations aresignificantly different then the t-test

    for un-equal variances should be used(Table 2).Evaluating the test statistic t

    =>t(5.404.76)

    2.205 15 15

    Sc1.4712 (51)2.7502 (51)

    (552)

    2.205

    The 95% critical value is 2.306 for

    n = 8 (n1 + n2 2 ) degrees of freedom.

    This exceeds the calculated value of

    0.459, thus the null hypothesis (H0)

    cannot be rejected and we conclude

    there is no significant difference between

    the means or the results given by the

    two methods.

    Example 3 (5)

    Two methods are available fordetermining the concentration ofvitamins in foodstuffs. To comparethe methods several different samplematrices are prepared using the sametechnique. Each sample preparation isthen divided into two aliquots andreadings are obtained using the twomethods, ideally commencing at thesame time to lessen the possible effectsof sample deterioration. The results are

    shown in Table 4.The null hypothesis is H0: d = 0

    against the alternative H1: d 0

    The test is a two-tailed test as we areinterested in both d

    0

    The mean d

    = 0.475 and the samplestandard deviation of the paireddifferences is sd = 0.700

    The tabulated value of tcrit (with

    n = 7 degrees of freedom, at the 95%

    confidence limit) is 2.365. Since thecalculated value is less than the critical

    value, H0 cannot be rejected and it

    follows that there is no difference between

    the two techniques.

    t0.475 80.700

    1.918

    t 0.642.2050.632

    0.641.395

    0.459

    table 3 Results from two methods used to determine concentrations of selenium.

    x s

    Method 1 4.2 4.5 6.8 7.2 4.3 5.40 1.471

    Method 2 9.2 4.0 1.9 5.2 3.5 4.76 2.750

    table 4 Comparison of two methods used to determine the concentration of vitamins in foodstuffs.

    Matrix

    Method 1 2 3 4 5 6 7 8

    A (mg/g) 2.52 3.13 4.33 2.25 2.79 3.04 2.19 2.16

    B (mg/g) 3.17 5.00 4.03 2.38 3.68 2.94 2.83 2.18

    Difference (d) -0.65 -1.87 0.30 -0.13 -0.89 0.10 -0.64 -0.02

  • 8/9/2019 Use of Statistics by Scientist

    7/22

  • 8/9/2019 Use of Statistics by Scientist

    8/22

    LCGC Europe Online Supplement0 statistics and data analysis

    Two-way ANOVA

    In a typical experiment things can be more

    complex than described previously. For

    example, in Example 2 the aim is to find

    out if time and/or temperature have any

    effect on protein yield when analysing

    samples of tinned ham. When analysing

    data from this type of experiment we usetwo-way ANOVA. Two-way ANOVA can

    test the significance of each of two

    experimental variables (factors or

    treatments) with respect to the response,

    such as an instrument's output. When

    replicate measurements are made we can

    also examine whether or not there are

    significant interactions between variables.

    An interaction is said to be present when

    the response being measured changes

    more than can be explained from the

    change in level of an individual factor. This

    is illustrated in Figure 2 for a process withtwo factors (Y and Z) when both factors

    are studied at two levels (low and high). In

    Figure 2(b), the changes in response

    caused by Y depend on Z, and vice versa.

    In two-way ANOVA we ask the

    following questions:

    Is there a significant interaction between

    the two factors (variables)?

    Does a change in any of the factors

    affect the measured result?

    It is important to check the answers in the

    right order: Figure 3 illustrates the

    decision process. In the case of Example2 the questions are:

    Is there an interaction between

    temperature and time which affects the

    protein yield?

    Does time and/or temperature affect the

    protein yield?

    Using the built-in functions of a

    spreadsheet (in this case Excels data

    analysis tools two-factor analysis with

    replication) we see that there is a

    significant interaction between time and

    temperature and a significant effect of

    temperature alone (both p-value < 0.05and F > Fcrit). Following the process

    outlined in Figure 3, we consider the

    interaction question first by comparing the

    mean squares (MS) for the within-group

    variation with the interaction MS. This is

    reported in the results table of Example 2.

    F = 0.021911/0.004315 = 5.078

    If the interaction is significant (F > Fcrit),

    as in this case, then the individual factors

    (time and temperature) should each be

    compared with the MS for the interaction

    (not the within-group MS) thus:

    Ftemp = 0.024844/0.021911 = 1.134

    Example 2 Two-way ANOVAThe analysis of tinned ham was carried out at three temperatures (415, 435 and 460C) and three times (30, 60 and 90 minutes). Three analyses, determining proteinyield were made at each temperature and time. The measurements are summarizedin the diagram below and the results of the two-way ANOVA are given in the table.

    Time (min)

    415 435 460

    26.9

    27

    27.1

    27.2

    27.3

    26.9

    27

    27.1

    27.2

    27.3

    26.9

    27

    27.1

    27.2

    27.3

    26.9 2

    7

    27.1

    27.2

    27.3

    26.9 2

    7

    27.1

    27.2

    27.3

    26.9 2

    7

    27.1

    27.2

    27.3

    26.9 2

    7

    27.1

    27.2

    27.3

    26.9 2

    7

    27.1

    27.2

    27.3

    26.9 2

    7

    27.1

    27.2

    27.3

    Temp (C)

    415

    27.13

    27.2

    27.13

    27.29

    27.1327.23

    27.03

    27.13

    27.07

    435

    27.2

    26.97

    27.13

    27.07

    27.127.03

    27.2

    27.23

    27.27

    460

    27.03

    27.1

    27.13

    27.1

    27.0727.03

    27.03

    27.07

    26.9

    Time (min)/Temp (C)

    30

    30

    30

    60

    6060

    90

    90

    90

    Anova: Two-factor with replication

    Source of Variation SS df MS F P-value F crit

    Sample (=Time)

    Columns (=Temperature)

    Interaction

    Within

    Total

    0.000867

    0.049689

    0.087644

    0.077667

    0.215867

    2

    2

    4

    18

    26

    0.000433

    0.024844

    0.021911

    0.004315

    0.100429

    5.75794

    5.078112

    0.904952

    0.011667

    0.006437

    3.554561

    3.554561

    2.927749

    30

    60

    90

    Example 1 An example of one-way ANOVA carried out by Excel

    (Note: the data table has been split into two sections (A_1 to A_6, A_7 to A_12) for display purposes. The ANOVA iscarried out on a single table.)

    SS = sum of squares, df = degrees of freedom, MS = mean square (SS/df).

    The P-value is < 0.05 (Fvalue is > Fcrit - 95% confidence level for 11 and 36 degrees of freedom )

    therefore it can be concluded that there is a significant difference between the analysts' results.

    34.1

    34.1

    34.69

    34.6

    35.84

    36.58

    31.3

    34.19

    36.67

    37.33

    36.96

    36.83

    40.54

    40.67

    40.81

    40.78

    41.19

    40.29

    40.99

    40.4

    41.22

    39.61

    37.89

    36.67

    40.7140.91

    40.8

    38.42

    39.239.3

    39.3

    39.3

    42.542.3

    42.5

    42.5

    39.7539.69

    39.23

    39.73

    36.0437.03

    36.85

    36.24

    44.3645.73

    45.25

    45.34

    Replicate 1

    Replicate 2

    Replicate 3

    Replicate 4

    Replicate 1Replicate 2

    Replicate 3

    Replicate 4

    Anova: Single Factor

    Source of Variation SS df MS F P-value F crit

    Between Groups

    Within Groups438.7988

    35.6208

    11

    36

    39.8908

    0.98946740.31545 6.6E-17 2.066606

    A_1 A_2 A_3 A_4 A_5 A_6

    A_7 A_8 A_9 A_10 A_11 A_12

    Note: in the above example, the spreadsheet (Excel) labels Source of Variation as Sample, Columns, Interaction and Within.

    Sample = Time, Columns = Temperature, Interaction is the interaction between temperature and time, and Within is a

    measure of the within-group variation. (Note: Source of variation Columns = Temperature and Sample = Time).

  • 8/9/2019 Use of Statistics by Scientist

    9/22

    11LCGC Europe Online Supplement statistics and data analysis

    Ftime = 0.000433/0.021911 = 0.020

    Fcrit = 6.944, for 2 and 4 degrees of freedom (at the 95% confidence level)

    In other words, there is no significant difference between the interaction of time and

    temperature with respect to either of the individual factors, and, therefore, the interaction

    of temperature with time is worth further investigation. If one or both of the individual

    factors were significant compared with the interaction, then the individual factor or factors

    would dominate and for all practical purposes any interaction could be ignored.If the interaction term is not significant then it can be considered to be another small

    error term and can thus be pooled with the within-group (error) sums of squares term. It is

    the pooled value (SS2pooled) that is then used as the denominator in the F-test to

    determine if the individual factors affect the measured results significantly. To combine the

    sums of squares the following formula is used:

    where dofinter and dofwithin are the degrees of freedom for the interaction term and

    error term, and SSinter and SSwithin are the sums of squares for the interaction term and

    error term, respectively.

    (dofpooled dofinter dofwithin)

    Selecting the ANOVA method

    One-way ANOVA should be used when there is only one factor being considered and

    replicate data from changing the level of that factor are available. Two-way ANOVA (with

    or without replication) is used when there are two factors being considered. If no replicate

    data are collected then the interactions between the two factors cannot be calculated.

    Higher level ANOVAs are also available for looking at more than two factors.

    Advantages of ANOVA

    Compared with using multiple t-tests, one-way and two-way ANOVA require fewer

    measurements to discover significant effects (i.e., the tests are said to have more power).

    This is one reason why ANOVA is used frequently when analysing data from statisticallydesigned experiments.

    Other ANOVA and multivariate ANOVA (MANOVA) methods exist for more complex

    experimental situations but a description of these is beyond the scope of this introductory

    article. More details can be found in reference 6.

    ss2pooledss intersswithindofinterdofwithin

    Interpretation of the result(s)

    To reiterate the interpretation of ANOVA

    results, a calculated F-value that is greater

    than Fcrit for a stated level of confidence

    (typically 95%) means that the difference

    being tested is statistically significant at

    that level. As an alternative to using the F-

    values the p-value can be used to indicatethe degree of confidence we have that

    there is a significant difference between

    means (i.e., (1-p) * 100 is the percentage

    confidence). Normally a p-value of 0.05

    is considered to denote a significant

    difference.

    Note: Extrapolation of ANOVA results is

    not advisable, so in Example 2 for instance,

    it is impossible to say if a time of 15 or 120

    minutes would lead to a measurable effect

    on protein yield. It is, therefore, always

    more economic in the long run to design

    the experiment in advance, in order tocover the likely ranges of the parameter(s)

    of interest.

    Avoiding some of

    the pitfalls using ANOVA

    In ANOVA it is assumed that the data for

    each variable are normally distributed.

    Usually in ANOVA we dont have a large

    amount of data so it is difficult to prove

    any departure from normality. It has been

    shown, however, that even quite large

    deviations do not affect the decisions

    made on the basis of the F-test.A more important assumption about

    ANOVA is that the variance (spread)

    between groups is homogeneous

    (homoscedastic). If this is not the case (this

    often happens in chemistry, see Figure 1)

    then the F-test can suggest a statistically

    48

    46

    44

    42

    40

    38

    36

    34

    32

    30A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12

    Mean

    Analyst ID

    Analyteconcentration(ppm

    )

    totalstandarddeviation

    figure 1 Plot comparing the results from 12 analysts.

  • 8/9/2019 Use of Statistics by Scientist

    10/22

    LCGC Europe Online Supplement2 statistics and data analysis

    problem in the data structure by

    transforming it, such as by taking logs (7).

    If the variability within a group is

    correlated with its mean value then

    ANOVA may not be appropriate and/or it

    may indicate the presence of outliers in the

    data (Figure 4). Cochran's test (5) can be

    used to test for variance outliers.

    Conclusions

    ANOVA is a powerful tool for

    determining if there is a statistically

    significant difference between two or

    more sets of data.

    One-way ANOVA should be used

    when we are comparing several sets

    of observations.

    Two-way ANOVA is the method

    used when there are two separate

    factors that may be influencing a result.

    Except for the smallest of data setsANOVA is best carried out using a

    spreadsheet or statistical software

    package.

    You should always plot your data to

    make sure the assumptions ANOVA is

    based on are not violated.

    Acknowledgements

    The preparation of this paper was

    supported under a contract with the UK

    Department of Trade and Industry as part

    of the National Measurement System Valid

    Analytical Measurement Programme (VAM)(8).

    References(1) S. Burke, Scientific Data Management1(1),

    3238, September 1997.(2) G.A. Millikem and D.E. Johnson,Analysis of

    Messy Data, Volume 1: Designed Experiments,Van Nostrand Reinhold Company, New York,USA (1984).

    (3) J.C. Miller and J.N. Miller, Statistics forAnalytical Chemistry, Ellis Horwood PTRPrentice Hall, London, UK (ISBN 0 13 0309907).

    (4) C. Chatfield, Statistics for Technology,Chapman & Hall, London, UK (ISBN 041225340 2).

    (5) T.J. Farrant, Practical Statistics for the Analytical

    Scientist, A Bench Guide, Royal Society ofChemistry, London, UK (ISBN 0 85404 442 6)(1997).

    (6) K.V. Mardia, J.T. Kent and J.M. Bibby,Multivariate Analysis, Academic Press Inc. (ISBN0 12 471252 5) (1979).

    (7) ISO 4259: 1992. Petroleum Products -Determination and Application of PrecisionData in Relation to Methods of Test. Annex E,International Organisation for Standardisation,Geneva, Switzerland (1992).

    (8) M. Sargent, VAM Bulletin, Issue 13, 45,Laboratory of the Government Chemist(Autumn 1995).

    Shaun Burke currently works in the Food

    Technology Department of RHM Technology

    Ltd, High Wycombe, Buckinghamshire, UK.

    However, these articles were produced while

    he was working at LGC, Teddington,

    Middlesex, UK (http://www.lgc.co.uk).

    a number of tests for heteroscedasity (i.e.,

    Bartlett's test (5) and Levene's test (2)). It

    may be possible to overcome this type of

    significant difference when none is

    present. The best way to avoid this pitfall

    is, as ever, to plot the data. There also exist

    Variance

    Mean value

    Significantly different means by ANOVA

    Unreliable high mean (may contain outliers)

    figure 4 A plot of variance versus the mean.

    Pool the within-group andinteraction sums of squares

    Compare pooled meansquares with individual factor

    mean squares

    Start

    Yes

    No

    Compare interaction meansquares with individual factor

    mean squares

    Significantdifference?

    (F>F crit)

    Compare within-group meansquares with interaction mean

    squares

    figure 3 Comparing mean squares in two-way ANOVA with replication.

    ZHigh

    ZLow

    ZHigh

    ZLow

    YLow

    (a) Y and Z are independent (b) Y and Z are interacting

    Response

    Response

    YHigh

    YLow

    YHigh

    figure 2 Interactive factors.

  • 8/9/2019 Use of Statistics by Scientist

    11/22

    13LCGC Europe Online Supplement statistics and data analysis

    Calibration is fundamental to achieving

    consistency of measurement. Often

    calibration involves establishing the

    relationship between an instrument

    response and one or more reference

    values. Linear regression is one of the most

    frequently used statistical methods in

    calibration. Once the relationship between

    the input value and the response value

    (assumed to be represented by a straight

    line) is established, the calibration model isused in reverse; that is, to predict a value

    from an instrument response. In general,

    regression methods are also useful for

    establishing relationships of all kinds, not

    just linear relationships. This paper

    concentrates on the practical applications

    of linear regression and the interpretation

    of the regression statistics. For those of you

    who want to know about the theory of

    regression there are some excellent

    references (16).

    For anyone intending to apply linear

    least-squares regression to their own data,it is recommended that a statistics/graphics

    package is used. This will speed up the

    production of the graphs needed to

    confirm the validity of the regression

    statistics. The built-in functions of a

    spreadsheet can also be used if the

    routines have been validated for accuracy

    (e.g., using standard data sets (7)).

    What is regression?

    In statistics, the term regression is used to

    describe a group of methods that

    summarize the degree of association

    between one variable (or set of variables)

    and another variable (or set of variables).

    The most common statistical method used

    to do this is least-squares regression, which

    works by finding the best curve through

    the data that minimizes the sums of

    squares of the residuals. The important

    term here is the best curve, not the

    method by which this is achieved. There

    are a number of least-squares regression

    models, for example, linear (the most

    common type), logarithmic, exponential

    and power. As already stated, this paper

    will concentrate on linear least-squaresregression.

    [You should also be aware that there are

    other regression methods, such as ranked

    regression, multiple linear regression, non-

    linear regression, principal-component

    regression, partial least-squares regression,

    etc., which are useful for analysing instrument

    or chemically derived data, but are beyond

    the scope of this introductory text.]

    What do the linear least-squares

    regression statistics mean?

    Correlation coefficient: Whether you use acalculators built-in functions, a

    spreadsheet or a statistics package, the

    first statistic most chemists look at when

    performing this analysis is the correlation

    coefficient (r). The correlation coefficient

    ranges from 1, a perfect negative

    relationship, through zero (no relationship),

    to +1, a perfect positive relationship

    (Figures 1(ac)). The correlation coefficient

    is, therefore, a measure of the degree of

    linear relationship between two sets of

    data. However, the r value is open to

    misinterpretation (8) (Figures 1(d) and (e),

    show instances in which the r values alone

    would give the wrong impression of the

    underlying relationship). Indeed, it is

    possible for several different data sets to

    yield identical regression statistics (r value,

    residual sum of squares, slope and

    intercept), but still not satisfy the linear

    assumption in all cases (9). It, therefore,

    remains essential to plot the data in order

    to check that linear least-squares statistics

    are appropriate.

    As in the t-tests discussed in the first

    paper (10) in this series, the statistical

    significance of the correlation coefficient isdependent on the number of data points.

    To test if a particular r value indicates a

    statistically significant relationship we can

    use the Pearsons correlation coefficient

    test (Table 1). Thus, if we only have four

    points (for which the number of degrees of

    freedom is 2) a linear least-squares

    correlation coefficient of 0.94 will not be

    significant at the 95% confidence level.

    However, if there are more than 60 points

    an r value of just 0.26 (r2 = 0.0676) would

    indicate a significant, but not very strong,

    positive linear relationship. In other words,a relationship can be statistically significant

    but of no practical value. Note that the test

    used here simply shows whether two sets

    are linearly related; it does not prove

    linearity or adequacy of fit.

    It is also important to note that a

    significant correlation between one

    variable and another should not be taken

    as an indication of causality. For example,

    there is a negative correlation between

    time (measured in months) and catalyst

    performance in car exhaust systems.

    However, time is not the cause of the

    deterioration, it is the build up of sulfur

    and phosphorous compounds that

    gradually poisons the catalyst. Causality is,

    One of the most frequently used statistical methods in calibration is linearregression. This third paper in our statistics refresher series concentrates onthe practical applications of linear regression and the interpretation of the

    regression statistics.

    Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

    Regression

    and Calibration

  • 8/9/2019 Use of Statistics by Scientist

    12/22

    LCGC Europe Online Supplement4 statistics and data analysis

    in fact, very difficult to prove unless the

    chemist can vary systematically and

    independently all critical parameters, whilemeasuring the response for each change.

    Slope and intercept

    In linear regression the relationship

    between the X and Y data is assumed to

    be represented by a straight line, Y = a +

    bX (see Figure 2), where Y is the estimated

    response/dependent variable, b is the slope

    (gradient) of the regression line and a is

    the intercept (Y value when X = 0). This

    straight-line model is only appropriate if

    the data approximately fits the assumption

    of linearity. This can be tested for byplotting the data and looking for curvature

    (e.g., Figure 1(d)) or by plotting the

    residuals against the predicted Y values or

    X values (see Figure 3).

    Although the relationship may be known

    to be non-linear (i.e., follow a different

    functional form, such as an exponential

    curve), it can sometimes be made to fit the

    linear assumption by transforming the data

    in line with the function, for example, by

    taking logarithms or squaring the Y and/or

    X data. Note that if such transformations

    are performed, weighted regression(discussed later) should be used to obtain

    an accurate model. Weighting is required

    because of changes in the residual/error

    structure of the regression model. Using

    non-linear regression may, however, be a

    better alternative to transforming the data

    when this option is available in the

    statistical packages you are using.

    Residuals and residual standard error

    A residual value is calculated by taking the

    difference between the predicted value

    and the actual value (see Figure 2). Whenthe residuals are plotted against the

    predicted (or actual) data values the plot

    becomes a powerful diagnostic tool,

    enabling patterns and curvature in the data

    to be recognized (Figure 3). It can also be

    used to highlight points of influence (see

    Bias, leverage and outliers overleaf).

    The residual standard error (RSE, also

    known as the residual standard deviation,

    RSD) is a statistical measure of the average

    residual. In other words, it is an estimate

    of the average error (or deviation) about

    the regression line. The RSE is used to

    calculate many useful regression statistics

    including confidence intervals and outlier

    test values.

    where s(y) is the standard deviation of the y values in the calibration, n is the number of

    data pairs and r is the least-squares regression correlation coefficient.

    Confidence intervalsAs with most statistics, the slope (b) and intercept (a) are estimates based on a finite

    sample, so there is some uncertainty in the values. (Note: Strictly, the uncertainty arises

    from random variability between sets of data. There may be other uncertainties, such as

    measurement bias, but these are outside the scope of this article.) This uncertainty is

    quantified in most statistical routines by displaying the confidence limits and other

    statistics, such as the standard error and p values. Examples of these statistics are given in

    Table 2.

    RSE s(y)(n1)

    (n2) 1 r2

    table 1 Pearson's correlation coefficient test.

    Degrees of freedom Confidence level

    (n-2) 95% ( = 0.05) 99% ( = 0.01)

    2 0.950 0.9903 0.878 0.959

    4 0.811 0.917

    5 0.754 0.875

    6 0.707 0.834

    7 0.666 0.798

    8 0.632 0.765

    9 0.602 0.735

    10 0.576 0.708

    11 0.553 0.684

    12 0.532 0.661

    13 0.514 0.641

    14 0.497 0.623

    15 0.482 0.606

    20 0.423 0.537

    30 0.349 0.449

    40 0.304 0.393

    60 0.250 0.325

    Significant correlation when |r| table value

    1

    -0.8

    -1

    0.8

    0.6

    0.4

    0.2

    0

    5 10 20 25 30 35 40 45 50 5515 60-0.2

    -0.4

    -0.6Correlationc

    oefficient(r)

    Degrees of freedom (n-2)

    95% confidence level

    99% confidence level

  • 8/9/2019 Use of Statistics by Scientist

    13/22

    15LCGC Europe Online Supplement statistics and data analysis

    The p value is the probability that a value could arise by chance if the true value was

    zero. By convention a p value of less than 0.05 indicates a significant non-zero statistic.

    Thus, examining the spreadsheets results, we can see that there is no reason to reject the

    hypothesis that the intercept is zero, but there is a significant non-zero positive

    gradient/relationship. The confidence intervals for the regression line can be plotted for all

    points along the x-axis and is dumbbell in shape (Figure 2). In practice, this means that the

    model is more certain in the middle than at the extremes, which in turn has important

    consequences for extrapolating relationships.When regression is used to construct a calibration model, the calibration graph is used

    in reverse (i.e., we predict the X value from the instrument response [Y-value]). This

    prediction has an associated uncertainty (expressed as a confidence interval)

    Conf. interval for the prediction is:

    where a is the intercept and b is the slope obtained from the regression equation.

    Y

    is the mean value of the response (e.g., instrument readings) for m replicates (replicatesare repeat measurements made at the same level).

    y is the mean of the y data for the n points in the calibration. tis the critical value obtained

    from t-tables for n2 degrees of freedom. s(x) is the standard deviation for the

    XpredictedtRSE

    b1m

    1n

    Yy2

    b2 n1 sx2

    Xpredicted Y a

    b

    x data for the n points in the calibration.

    RSE is the residual standard error for the

    calibration.

    If we want, therefore, to reduce the size

    of the confidence interval of the prediction

    there are several things that can be done.

    1. Make sure that the unknown

    determinations of interest are close tothe centre of the calibration (i.e., close

    to the values x,y [the centroid point]).

    This suggests that if we want a small

    confidence interval at low values of x

    then the standards/reference samples

    used in the calibration should be

    concentrated around this region. For

    example, in analytical chemistry, a typical

    pattern of standard concentrations

    might be 0.05, 0.1, 0.2, 0.4, 0.8, 1.6

    possible outlier

    0.14

    0.06

    0.08

    0.1

    0.12

    0.04

    0.02

    0

    -0.02

    -0.04

    -0.06

    -0.08

    0 21 3 4 5 6 7 8 9 10

    Residuals

    X

    figure 3 Residuals plot.

    r = -1

    r = 0

    r = 0

    r = 0.99

    r = 0.9

    r = 0.9

    r = +1

    (a)

    (b)

    (d)

    (e)

    (f)

    (g)

    (c)

    figure 1 Correlation coefficients and

    goodness of fit.

    Y= -0.046 + 0.1124 * Xr = 0.98731

    Intercept Slope

    Correlation coefficient

    Intercept

    confidence limits for the prediction

    confidence limits forthe regression line

    Residuals

    1.4

    1.2

    1.0

    0.8

    0.6

    0.4

    0.2

    0.0

    -0.20 2 4 6 8 10 12

    Y

    X

    figure 2 Calibration graph.

  • 8/9/2019 Use of Statistics by Scientist

    14/22

    LCGC Europe Online Supplement6 statistics and data analysis

    (i.e., only one or two standards are used athigher concentrations). While this will lead

    to a smaller confidence interval at lower

    concentrations the calibration model will

    be prone to leverage errors (see below).

    2. Increase the number of points in the

    calibration (n). There is, however, little

    improvement to be gained by going

    above 10 calibration points unless

    standard preparation and analysis is

    rapid and cheap.

    3. Increase the number of replicate

    determinations for estimating the

    unknown (m). Once again there is alaw of diminishing returns, so the

    number of replicates should typically

    be in the range 2 to 5.

    4. The range of the calibration can be

    extended, providing the calibration is still

    linear.

    Bias, leverage and outliers

    Points of influence, which may or may not

    be outliers, can have a significant effect on

    the regression model and therefore, on its

    predictive ability. If a point is in the middle

    of the model (i.e., close to x

    ) but outlyingon the Y axis, its effect will be to move the

    regression line up or down. The point is

    then said to have influence because it

    introduces an offset (or bias) in the

    predicted values (see Figure 1(f)). If the

    point is towards one of the extreme ends

    of the plot its effect will be to tilt the

    regression line. The point is then said to

    have high leverage because it acts as a

    lever and changes the slope of the

    regression model (see Figure 1(g)).

    Leverage can be a major problem if one or

    two data points are a long way from all theother points along the X axis.

    A leverage statistic (ranging between1_n and 1) can be calculated for each value

    of x. There is no set value above which this

    leverage statistic indicates a point of

    influence. A value of 0.9 is, however, used

    by some statistical software packages.

    where xiis the x value for which the leverage statistic is to be calculated, n is the

    number of points in the calibration and x is the mean of all the x values in the calibration.

    To test if a data point (xi,yi) is an outlier (relative to the regression model) the following

    outlier test can be applied.

    where RSE is the residual standard error, sy is the standard deviation of the Y values, Yiis

    the y value, n is the number of points, y is the mean of all the y values in the calibration

    and residualmax is the largest residual value.For example, the test value for the suspected outlier in Figure 3 is 1.78 and the critical

    value is 2.37 (Table 3 for 10 data points). Although the point appears extreme, it could

    reasonably be expected to arise by chance within the data set.

    Extrapolation and interpolation

    We have already mentioned that the regression line is subject to some uncertainty and that

    this uncertainty becomes greater at the extremes of the line. If we, therefore, try to

    extrapolate much beyond the point where we have real data (10%) there may be

    relatively large errors associated with the predicted value. Conversely, interpolation near

    the middle of the calibration will minimize the prediction uncertainty. It follows, therefore,

    that when constructing a calibration graph, the standards should cover a larger range of

    concentrations than the analyst is interested in. Alternatively, several calibration graphscovering smaller, overlapping, concentration ranges can be constructed.

    Test valueresidualma x

    RSE 1 1n Yiy

    2

    n1 sy2

    Leverage i1n

    xix2

    j= 1

    n

    xj

    x

    2

    table 2 Statistics obtained using Excel 5.0 regression analysis function from the data used to generate the calibration graph in Figure 2.

    Coefficients Standard Error tStat p value Lower 95% Upper 95%

    Intercept -0.046000012 0.039648848 -1.160185324 0.279423552 -0.137430479 0.045430455

    Slope 0.112363638 0.00638999 17.58432015 1.11755E-07 0.097628284 0.127098992

    *Note the large number of significant figures. In fact none of the values above warrant more than 3 significant figures!

    Concentration

    Response

    (a)

    Predicted value

    Residuals

    (b)

    0

    figure 4 Plots of typical instrument response versus concentration.

  • 8/9/2019 Use of Statistics by Scientist

    15/22

    17LCGC Europe Online Supplement statistics and data analysis

    Weighted linear regression and calibration

    In analytical science we often find that the precision changes with concentration. In

    particular, the standard deviation of the data is proportional to the magnitude of the value

    being measured, (see Figure 4(a)). A residuals plot will tend to show this relationship even

    more clearly (Figure 4(b)). When this relationship is observed (or if the data has been

    transformed before regression analysis), weighted linear regression should be used forobtaining the calibration curve (3). The following description shows how the weighted

    regression works. Dont be put off by the equations as most modern statistical software

    packages will perform the calculations for you. They are only included in the text for

    completeness.

    Weighted regression works by giving points known to have a better precision a higher

    weighting than those with lower precision. During method validation the way the standard

    deviation varies with concentration should have been investigated. This relationship can

    then be used to calculate the initial weightings

    at each of the n

    concentrations in the calibration.

    These initial weightings can then be

    standardized by multiplying by the number

    of calibration points divided by the sum of

    all the weights to give the final weights (Wi).

    The regression model generated will be

    similar to that for non-weighted linear

    regression. The prediction confidence

    intervals will, however, be different.

    The weighted prediction (xw) for a given

    instrument reading (y) for the regression

    model forcing the line through the origin (y

    = bx) is:

    with

    where Y

    is the mean value of the

    response (e.g., instrument readings) for m

    replicates and xiand yiare the data pair for

    the ith point.

    By assuming the regression line goes

    through the origin a better estimate of theslope is obtained, providing that the

    assumption of a zero intercept is correct.

    This may be a reasonable assumption in

    some instrument calibrations. However, in

    most cases, the regression line will no

    longer represent the least-squares best

    line through the data.

    b(w)Wixiyi

    i= 1

    n

    Wixi2

    i= 1

    n

    X(w)predicted Ybw

    Wi

    wi

    n

    wj

    j= 1

    n

    (wi1si

    2)

    table 3 Outlier test for simple linear least-squares regression.

    Sample size Confidence table-value

    (n) 95% 99%

    5 1.74 1.75

    6 1.93 1.98

    7 2.08 2.17

    8 2.20 2.23

    9 2.29 2.44

    10 2.37 2.55

    12 2.49 2.70

    14 2.58 2.82

    16 2.66 2.92

    18 2.72 3.00

    20 2.77 3.06

    25 2.88 3.2530 2.96 3.36

    35 3.02 3.40

    40 3.08 3.43

    45 3.12 3.47

    50 3.16 3.51

    60 3.23 3.57

    70 3.29 3.62

    80 3.33 3.68

    90 3.37 3.73

    100 3.41 3.78

    3

    3.5

    4

    2.5

    2

    1.50 10 30 40 50 60 70 80 90 10020

    Testvalue

    Number of samples (n)

    95%

    99%

  • 8/9/2019 Use of Statistics by Scientist

    16/22

    LCGC Europe Online Supplement8 statistics and data analysis

    References(1) G.W. Snedecor and W.G. Cochran, Statistical

    Methods, The Iowa State University Press, USA,6th edition (1967).

    (2) N. Draper and H. Smith,Applied RegressionAnalysis, John Wiley & Sons Inc., New York,USA, 2nd edition (1981).

    (3) BS ISO 11095: Linear Calibration UsingReference Materials (1996).

    (4) J.C. Miller and J.N. Miller, Statistics forAnalytical Chemistry, Ellis Harwood PTR PrenticeHall, London, UK.

    (5) A.R. Hoshmand, Statistical Methods forEnvironmental and Agricultural Sciences, 2ndedition, CRC Press (ISBN 0-8493-3152-8)(1998).

    (6) T.J. Farrant, Practical Statistics for the AnalyticalScientist, A Bench Guide, Royal Society ofChemistry, London, UK (ISBN 0 85404 4226)(1997).

    (7) Statistical Software Qualification: ReferenceData Sets, Eds. B.P. Butler, M.G. Cox, S.L.R.Ellison and W.A. Hardcastle, Royal Society ofChemistry, London, UK (ISBN 0-85404-422-1)(1996).

    (8) H. Sahai and R.P. Singh, Virginia J. Sci., 40(1),59, (1989).

    (9) F.J. Anscombe, Graphs in Statistical Analysis,American Statistician, 27, 1721, February1973.

    (10) S. Burke, Scientific Data Management, 1(1),3238, September 1997.

    (11) M. Sargent, VAM Bulletin, Issue 13, 45,Laboratory of the Government Chemist(Autumn 1995).

    Shaun Burke currently works in the Food

    Technology Department of RHM Technology

    Ltd, High Wycombe, Buckinghamshire, UK.

    However, these articles were produced while

    he was working at LGC, Teddington,

    Middlesex, UK (http://www.lgc.co.uk).

    The associated uncertainty for the weighted prediction, expressed as a confidence

    interval is then:

    Conf. interval for the prediction is

    where tis the critical value obtained from ttables for n2 degrees of freedom at a

    stated significance level (typically a = 0.05), Wiis the weighted standard deviation for the

    x data for the ith point in the calibration, m is the number of replicates and the weighted

    residual.

    Standard error for the calibration

    Conclusions

    Always plot the data. Dont rely on the regression statistics to indicate a linear

    relationship. For example, the correlation coefficient is not a reliable measure of

    goodness-of-fit. Always examine the residuals plot. This is a valuable diagnostic tool.

    Remove points of influence (leverage, bias and outlying points) only if a reason can be

    found for their aberrant behaviour.

    Be aware that a regression line is an estimate of the best line through the data and

    that there is some uncertainty associated with it. The uncertainty, in the form of a

    confidence interval, should be reported with the interpolated result obtained from any

    linear regression calibrations.

    Acknowledgement

    The preparation of this paper was supported under a contract with the Department of

    Trade and Industry as part of the National Measurement System Valid Analytical

    Measurement Programme (VAM) (11).

    RSE(w)Wjyj

    2j= 1

    n

    b(w)2

    Wjxj2

    j= 1

    n

    n1

    X(w)predictedtRSE(w)

    b(w)

    1mWi

    Y

    2

    b(w)2

    Wjxj2

    j= 1

    n

  • 8/9/2019 Use of Statistics by Scientist

    17/22

    19LCGC Europe Online Supplement statistics and data analysis

    This is the last article in a series of short

    papers introducing basic statistical methods

    of use in analytical science. In the three

    previous papers (13) we have assumed

    the data has been tidy; that is, normally

    distributed with no anomalous and/or

    missing results. In the real world, however,

    we often need to deal with messy data,

    for example data sets that contain

    transcription errors, unexpected extreme

    results or are skewed. How we deal with

    this type of data is the subject of this article.

    Transcription errors

    Transcription errors can normally be

    corrected by implementing good quality

    control procedures before statistical

    analysis is carried out. For example, the

    data can be independently checked or,

    more rarely, the data can be entered, again

    independently, into two separate files and

    the files compared electronically to

    highlight any discrepancies. There are also

    a number of outlier tests that can be used

    to highlight anomalous values before other

    statistics are calculated. These tests do not

    remove the need for good quality

    assurance; rather they should be seen as

    an additional quality check.

    Missing data

    No matter how well our experiments are

    planned there will always be times when

    something goes wrong, resulting in gaps in

    the data. Some statistical procedures will

    not work as well, or at all, with some data

    missing. The best recourse is always to

    repeat the experiment to generate the

    complete data set. Sometimes, however,

    this is not feasible, particularly where

    readings are taken at set times or the cost

    of retesting is prohibitive, so alternative

    ways of addressing this problem are needed.

    Current statistical software packages

    typically deal with missing data by one of

    three methods:

    Casewise deletion excludes all examples

    (cases) that have missing data in at least

    one of the selected variables. For example,

    in ICPAAS (inductively coupled

    plasmaatomic absorption spectroscopy)

    calibrated with a number of standard

    solutions containing several metal ions at

    different concentrations, if the aluminium

    value were missing for a particular test

    portion, all the results for that test portion

    would be disregarded (See Table 1).

    This is the usual way of dealing with

    missing data, but it does not guarantee

    correct answers. This is particularly so, in

    complex (multivariate) data sets where it is

    possible to end up deleting the majority

    of your data if the missing data are

    randomly distributed across cases

    and variables.

    Pairwise deletion can be used as an

    alternative to casewise deletion in

    situations where parameters (correlation

    coefficients, for example) are calculated on

    successive pairs of variables (e.g., in a

    recovery experiment we may be interested

    in the correlations between material

    recovered and extraction time, temperature,

    particle size, polarity, etc. With pairwise

    deletion, if one solvent polarity measurement

    was missing only this single pair would be

    deleted from the correlation and the

    correlations for recovery versus extraction

    time and particle size would be unaffected)

    (see Table 2).

    Pairwise deletion can, however, lead to

    serious problems. For example, if there is a

    hidden systematic distribution of missing

    points then a bias may result when

    calculating a correlation matrix (i.e., different

    correlation coefficients in the matrix can be

    based on different subsets of cases).

    Mean substitution replaces all missing

    data in a variable by the mean value for

    that variable. Though this looks as if the

    This article, the fourth and final part of our statistics refresher series, looksat how to deal with messy data that contain transcription errors or extremeand skewed results.

    Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

    Missing Values, Outliers,Robust Statistics &

    Non-parametric Methods

    table 1 Casewise deletion.

    Solution 1

    Solution 2

    Solution 3

    Solution 4

    Al

    567

    234

    B

    94.5

    72.1

    34.0

    97.4

    Fe

    578

    673

    674

    429

    Ni

    23.1

    7.6

    44.7

    82.9

    Solution 2Solution 4

    Al

    567234

    B

    72.197.4

    Fe

    673429

    Ni

    7.682.9

    Casewise deletion. Statistical analysisonly carried out on the reduced data set.

  • 8/9/2019 Use of Statistics by Scientist

    18/22

    LCGC Europe Online Supplement0 statistics and data analysis

    data set is now complete, mean substitution

    has its own disadvantages. The variability

    in the data set is artificially decreased in

    direct proportion to the number of missingdata points, leading to underestimates of

    dispersion (the spread of the data). Mean

    substitution may also considerably change

    the values of some other statistics, such as

    linear regression statistics (3), particularly

    where correlations are strong (See Table 3).

    Examples of these three approaches are

    illustrated in Figure 1, for the calculation of

    a correlation matrix, where the correlation

    coefficient (r) (3) is determined for each

    paired combination of the five variables,

    A to E. Note, how the r value can increase,

    diminish or even reverse sign depending on

    which method is chosen to handle the

    missing data (i.e., the A, B correlation

    coefficients).

    Extreme values,

    stragglers and outliers

    Extreme values are defined as observations

    in a sample, so far separated in value from

    the remainder as to suggest that they may

    be from a different population, or the

    result of an error in measurement (6).

    Extreme values can also be subdivided into

    stragglers, extreme values detected

    between the 95% and 99% confidence

    levels; and outliers, extreme values at >

    99% confidence level.

    It is tempting to remove extreme values

    automatically from a data set, because

    they can alter the calculated statistics, e.g.,

    increase the estimate of variance (a

    measure of spread), or possibly introduce a

    bias in the calculated mean. There is one

    golden rule however: no value should be

    removed from a data set on statistical

    grounds alone. Statistical grounds include

    outlier testing.

    Outlier tests tell you, on the basis of

    some simple assumptions, where you are

    most likely to have a technical error; they

    do not tell you that the point is wrong.

    No matter how extreme a value is in a set

    of data, the suspect value could

    nonetheless be a correct piece of

    information (1). Only with experience or

    the identification of a particular cause can

    data be declared wrong and removed.

    So, given that we understand that the

    tests only tell us where to look, how do we

    test for outliers? If we have good grounds

    for believing our data is normally

    distributed then a number of outlier tests

    (sometimes called Q-tests) are available

    that identify extreme values in an objective

    way (7,8). Good grounds for believing the

    data is normal are

    past experience of similar data

    passing normality tests, for example,KolmogrovSmirnovLillefors test,

    ShapiroWilks test, skewness test,

    kurtosis test (7,9) etc.

    plots of the data, e.g., frequency

    histogram normal probability plots (1,7).Note that the tests used to check

    table 2 Pairwise deletion.

    Sample 1

    Sample 2

    Sample 3

    Sample 4

    Recovery%

    Extractiontime

    (mins)

    ParticleSize(m)

    SolventPolarity(pKa)

    RecoveryvsExtraction

    time

    RecoveryvsParticle

    Size

    RecoveryvsSolventPolarity

    Pairwise deletion. Statistical analysis unaffected exceptfor when one of a pair of data points are missing.

    93

    105

    99

    73

    20

    120

    180

    10

    90

    150

    50

    500

    1.8

    1.0

    1.5

    0.728886(4)

    -0.87495(4)

    0.033942(3)

    r(number of data points

    in the correlation)

    table 3 Mean substitution.

    Solution 1

    Solution 2

    Solution 3

    Solution 4

    Al

    567

    234

    B

    94.5

    72.1

    34.0

    97.4

    Fe

    578

    673

    674

    429

    Ni

    23.1

    7.6

    44.7

    82.9

    Solution 1

    Solution 2

    Solution 3

    Solution 4

    Al

    400.5

    567

    400.5

    234

    B

    94.5

    72.1

    34.0

    97.4

    Fe

    578

    673

    674

    429

    Ni

    23.1

    7.6

    44.7

    82.9

    Mean substitution. Statistical analysis carriedout on pseudo completed data with noallowance made for errors in estimated values.

    Box 1: Imputation (4,5) is yet another method that is increasingly being used tohandle missing data. It is, however, not yet widely available in statistical softwarepackages. In its simplest ad hoc form an imputed value is substituted for themissing value (e.g., mean substitution already discussed above is a form ofimputation). In its more general/systematic form, however, the imputed missingvalues are predicted from patterns in the real (non-missing) data. A total of mpossible imputed values are calculated for each missing value (using a suitablestatistical model derived from the patterns in the data) and then m possiblecomplete data sets are analysed in turn by the selected statistical method. The mintermediate results are then pooled to yield the final result (statistic) and anestimate of its uncertainty. This method works well providing that the missingdata is randomly distributed and the model used to predict the inputed valuesis sensible.

  • 8/9/2019 Use of Statistics by Scientist

    19/22

    21LCGC Europe Online Supplement statistics and data analysis

    normality usually require a significant

    amount of data (a minimum of 1015

    results are recommended depending on

    the normality test applied). For this reason

    there will be many examples in analytical

    science where either it will be impractical

    to carry out such tests, or the tests will not

    tell us anything meaningful.

    If we are not sure the data set is normally

    distributed then robust statistics and/or

    non-parametric (distribution independent)

    tests can be applied to the data. These

    three approaches (outlier tests, robust

    estimates and non-parametric methods)

    are examined in more detail below.

    Outlier tests

    In analytical chemistry it is rare that we

    have large numbers of replicate data, and

    small data sets often show fortuitous

    grouping and consequent apparent

    outliers. Outlier tests should, therefore, be

    used with care and, of course, identified

    data points should only be removed if a

    technical reason can be found for their

    aberrant behaviour.

    Most outlier tests look at some measure

    of the relative distance of a suspect point

    from the mean value. This measure is then

    assessed to see if the extreme value could

    reasonably be expected to have arisen by

    chance. Most of the tests look for single

    extreme values (Figure 2(a)), but

    sometimes it is possible for several

    outliers to be present in the same data

    set. These can be identified in one of two

    ways:

    by iteratively applying the outlier test

    by using tests that look for pairs of

    extreme values, i.e., outliers that are

    masking each other (see Figure 2(b) and

    2(c)).

    Note, as a rule of thumb, if more than

    20% of the data are identified as outlying

    you should start to question your

    assumption about the data distribution

    and/or the quality of the data collected.

    The appropriate outlier tests for the

    three situations described in Figure 2 are:

    2(a) Grubbs 1, Dixon or Nalimov; 2(b)

    Grubbs 2 and 2(c) Grubbs 3.

    We will concentrate on the three

    Grubbs tests (7). The test values are

    calculated using the formulae below, after

    the data are arranged in ascending order.

    (a) or

    Outlier Outlier

    (c) or

    Outliers Outliers

    (b)

    Outlier Outlier

    figure 2 Outliers and masking.

    Cases

    1

    2

    3

    13

    14

    15

    mean

    value

    mean

    A

    105.1

    77.0

    86.0

    90.0

    90.0

    96.9

    99.2

    B

    101.7

    72.9

    82.2

    77.4

    91.3

    103.0

    92.4

    C

    115.1

    77.5

    78.9

    100.8

    89.2

    97.5

    94.6

    D

    101.0

    72.7

    78.0

    97.0

    81.3

    98.5

    89.4

    E

    95.2

    61.6

    91.7

    111.1

    100.5

    96.8

    91.7

    Variables / Factors

    = Data removed to show the effects

    of missing data.

    = Mean values replacing missing data.

    A

    B

    C

    D

    B0.62

    C0.68

    0.53

    D0.41

    0.47

    0.57

    E0.39

    0.50

    0.59

    0.61

    No missing data (15 cases)

    A

    B

    C

    D

    B

    -0.62

    C

    0.11

    -0.21

    D

    0.50

    -0.36

    0.91

    E

    0.02

    0.17

    0.71

    0.66

    Casewise deletion (only 5 cases remain)

    A

    B

    C

    D

    B

    0.01

    C

    -0.05

    0.40

    D

    0.02

    0.47

    0.47

    E

    0.36

    0.25

    0.43

    0.46

    A

    B

    C

    D

    B

    0.54(12)

    C

    0.55(12)

    0.50(11)

    D

    0.27(12)

    0.47(11)

    0.79(11)

    E

    0.23(11)

    0.77(10)

    0.70(10)

    0.71(10)

    Pairwise deletion (Variable number of cases)

    Mean substitution (15 cases)n

    15

    12

    11

    10

    5

    r

    0.514

    0.576

    0.602

    0.632

    0.950

    Note, at the 95% confidence level, significant correlations are indicated at3

    Correlation matrices with differentapproaches selected for missing data.

    figure 1 Effect of missing data on a correlation matrix.

    where,s is the standard deviation for the whole data set, xiis the suspected single

    outlier, i.e., the value furthest away from the mean, | | is the modulus the value of a

    calculation ignoring the sign of the result, x is the mean, n is the number of data points, xnand x1 are the most extreme values,sn-2 is the standard deviation for the data set

    G1= x xi

    s G2=

    xn x1s

    G3= 1 n 3 sn 2

    2

    n 1 s2

  • 8/9/2019 Use of Statistics by Scientist

    20/22

    LCGC Europe Online Supplement2 statistics and data analysis

    excluding the suspected pair of outlier

    values, i.e., the pair of values furthest away

    from the mean.

    If the test values (G1, G2, G3) are greater

    than the critical value obtained from tables

    (see Table 4) then the extreme value(s) areunlikely to have occurred by chance at the

    stated confidence level (see Box 2).

    Pitfalls of outlier tests

    Figure 3 shows three situations where

    outlier tests can misleadingly identify an

    extreme value.

    Figure 3(a) shows a situation common in

    chemical analysis. Because of limited

    measurement precision (rounding errors) it

    is possible to end up comparing a result

    which, no matter how close it is to the

    other values, is an infinite number of

    standard deviations away from the mean

    of the remaining results. This value will

    therefore always be flagged as an outlier.

    In Figure 3(b) there is a genuine long tail

    on the distribution that may cause

    successive outlying points to be identified.

    This type of distribution is surprisingly

    common in some types of chemical

    analysis, e.g., pesticide residues.

    If there is very little data (Figure 3(c)) an

    outlier can be identified by chance. In this

    situation it is possible that the identified

    point is closer to the true value and it is

    the other values that are the outliers. This

    occurs more often than we would like to

    admit; how many times do your procedures

    state average the best two out of three

    determinations?

    Outliers by variance

    When the data are from different groups

    (for example when comparing test

    methods via interlaboratory comparison) it

    is not only possible for individual points within a group to be outlying but also for the

    group means to have outliers with respect to each other. Another type of outlier that can

    occur is when the spread of data within one particular group is unusually small or large

    when compared with the spread of the other groups (see Figure 4).

    The same Grubbs tests that are used to determine the presence of within group

    outlying replicates may also be used to test for suspected outlying means. The Cochrans test can be used to test for the third case, that of a suspected

    outlying variance.

    To carry out the Cochrans test, the suspect variance is compared with the sum of all

    group variances. (The variance is a measure of spread and is simply the square of the

    standard deviation (1).)

    If this calculated ratio, Cn , exceeds the critical value obtained from statistical tables (7)

    then the suspect group spread is extreme. The choice of n is the average number of all

    sample results produced by all groups.

    The Cochrans test assumes the number of replicates within the groups are the same or

    at least similar ( 1). It also assumes that none of the data have been rounded and there

    are sufficient numbers of replicates to get a reasonable estimate of the variance. The

    Cochrans test should not be used iteratively as this could lead to a large percentage of

    data being removed (See Box 3).

    Robust statistics

    Robust statistics include methods that are largely unaffected by the presence of extreme

    values. The most commonly used of these statistics are as follows:

    Median: The median is a measure of central tendency1 and can be used instead of the

    mean. To calculate the median ( ) the data are arranged in order of magnitude and the

    median is then the central member of the series (or the mean of the two central

    members when there is an even number of data, i.e., there are equal numbers of

    observations smaller and greater than the median). For a symmetrical distribution the mean

    and median have the same value.

    Median Absolute Deviation (MAD): The MAD value is an estimate of the spread in the

    data similar to the standard deviation.

    x =

    xmxm xm 1

    2 when n iseven 2, 4, 6,

    when n is odd 1, 3, 5,where m = round up n

    2

    Cn= suspected s2

    Si2

    i= 1

    g where g is the numberof groupsand n =ni

    i= 1

    g

    g

    Box 2: Grubbs tests (worked example).

    13 replicates are ordered in ascending order.

    x1 xn47.876 47.997 48.065 48.118 48.151 48.211 48.251 48.559 48.634 48.711 49.005 49.166 49.484

    Grubbs critical values for 13 values are G1 = 2.331 and 2.607, G2 = 4.00 and 4.24, G3 = 0.6705 and 0.7667 for the 95%and 99% confidence levels. Since the test values are less than their respective critical values, in all cases, it can be concludedthere are no outlying values.

    G3= 1 10 0.123

    12 0.4982 = 0.587G2=

    49.484 47.8760.498

    = 3.23G1=49.484 48.479

    0.498 = 2.02

    n = 13, mean = 48.479, s= 0.498, sn22 = 0.123

  • 8/9/2019 Use of Statistics by Scientist

    21/22

    23LCGC Europe Online Supplement statistics and data analysis

    If the MAD value is scaled by a factor of 1.483 it becomes comparable with a standard

    deviation, this is the MADE value.

    For n values MAD = median xi x

    i= 1, 2,, n

    Other robust statistical estimates include

    trimmed mean and deviations, Winsorized

    mean and deviation, least median of

    squares (robust regression), Levenes test(heterogeneity in ANOVA), etc. A

    discussion of robust statistics in analytical

    chemistry can be found elsewhere (10, 11).

    Non-parametric tests

    Typical statistical tests incorporate

    assumptions about the underlying

    distribution of data (such as normality),

    and hence rely on distribution parameters.

    Non-parametric tests are so called

    because they make few or no assumptions

    about the distributions, and do not rely on

    distribution parameters. Their chief

    advantage is improved reliability when the

    distribution is unknown. There is at least

    one non-parametric equivalent for each

    parametric type of test (see Table 5). In a

    short article, such as this, it is impossible to

    describe the methodology for all these

    tests but more information can be found in

    other publications (12, 13).

    Conclusions

    Always check your data for transcription

    errors. Outlier tests can help to identify

    them as part of a quality control check.

    Delete extreme values only when a

    technical reason for their aberrant

    behaviour can be found.

    Missing data can result in misinterpretation

    of the resulting statistics so care should

    be taken with the method chosen to

    handle the gaps. If at all possible, further

    experiments should be carried out to fill

    in the missing points.

    MADE= 1.483 MAD

    table 4 Grubbs critical value table (5).

    95% confidence level 99% confidence

    level

    n G(1) G(2) G(3) G(1) G(2) G(3)

    3 1.153 2.00 --- 1.155 2.00 ---

    4 1.463 2.43 0.9992 1.492 2.44 1.0000

    5 1.672 2.75 0.9817 1.749 2.80 0.9965

    6 1.822 3.01 0.9436 1.944 3.10 0.9814

    7 1.938 3.22 0.8980 2.097 3.34 0.9560

    8 2.032 3.40 0.8522 2.221 3.54 0.9250

    9 2.110 3.55 0.8091 2.323 3.72 0.8918

    10 2.176 3.68 0.7695 2.410 3.88 0.8586

    12 2.285 3.91 0.7004 2.550 4.13 0.7957

    13 2.331 4.00 0.6705 2.607 4.24 0.7667

    15 2.409 4.17 0.6182 2.705 4.43 0.7141

    20 2.557 4.49 0.5196 2.884 4.79 0.6091

    25 2.663 4.73 0.4505 3.009 5.03 0.5320

    30 2.745 4.89 0.3992 3.103 5.19 0.4732

    35 2.811 5.026 0.3595 3.178 5.326 0.4270

    40 2.866 5.150 0.3276 3.240 5.450 0.3896

    50 2.956 5.350 0.2797 3.336 5.650 0.3328

    60 3.025 5.500 0.2450 3.411 5.800 0.2914

    70 3.082 5.638 0.2187 3.471 5.938 0.259980 3.130 5.730 0.1979 3.521 6.030 0.2350

    90 3.171 5.820 0.1810 3.563 6.120 0.2147

    100 3.207 5.900 0.1671 3.600 6.200 0.1980

    110 3.239 5.968 0.1553 3.632 6.268 0.1838

    120 3.267 6.030 0.1452 3.662 6.330 0.1716

    130 3.294 6.086 0.1364 3.688 6.386 0.1611

    140 3.318 6.137 0.1288 3.712 6.437 0.1519

    Box 3: Cochrans test (worked example).

    An interlaboratory study was carried out by 13 laboratories to determine the amount of cotton in a cotton/polyester fabric,85 determinations where carried out in total. The standard deviations of the data obtained by each of the 13 laboratorieswas as follows:Std. Dev. 0.202 0.402 0.332 0.236 0.318 0.452 0.210 0.074 0.525 0.067 0.609 0.246 0.198

    Cochrans critical value for n

    = 7 and g = 13 is 0.23 at the 95% confidence levels7.

    As the test value is greater than the critical values it can be concluded that the laboratory with the highest standard deviation(0.609) has an outlying spread of replicates and this laboratorys results therefore need to be investigated further. It is normalpractice in inter-laboratory comparisons not to test for low variance outliers, i.e., laboratories reporting unusually precise results.

    Cn= 0.6092

    0.2022 + 0.4022 .......0.2462 + 0.1982=0.371

    1.474= 0.252n =85

    13= 6.54 7

  • 8/9/2019 Use of Statistics by Scientist

    22/22

    LCGC Europe Online Supplement4 statistics and data analysis

    Outlier tests assume the data distribution

    is known. This assumption should be

    checked for validity before these tests

    are applied. Robust statistics avoid the need to use

    outlier tests by down-weighting the

    effect of extreme values.

    When knowledge about the underlying

    data distribution is limited, non-

    parametric methods should be used.

    NB: It should be noted that following a

    judgement in a US court, the Food and

    Drug Administration (FDA) in a guide

    Guide to inspection of pharmaceutical

    quality control laboratories has

    specifically prohibited the use of outlier

    tests.

    Acknowledgement

    The preparation of this paper was supported

    under a contract with the UKs Department

    of Trade and Industry as part of the

    National Measurement System Valid

    Analytical Measurement Programme (VAM)

    (14).

    References(1) S. Burke, Scientific Data Management1(1),

    3238, 1997.

    (2) S. Burke, Scientific Data Management2(1),3641, 1998.

    (3) S. Burke, Scientific Data Management2(2),3240, 1998.

    (4) J.L. Schafer, Monographs on Statistics andApplied Probability 72 Analysis of

    Incomplete Multivariate Data, Chapman & Hall(1997) ISBN 0-412-04061-1.

    (5) R.J.A. Little & D.B. Rubin, Statistical AnalysisWith Missing Data, John Wiley & Sons (1987),ISBN 0-471-80243-9.

    (6) ISO 3534. Statistics Vocabulary and Symbols.Part 1: Probability and general statistical terms,section 2.64. Geneva 1993.

    (7) T.J. Farrant, Practical statistics for the analyticalscientist: A bench guide, Royal Society of

    Chemistry 1997. (ISBN 0 85404 442 6).(8) V. Barret & T. Lewis, Outliers in Statistical Data,3rd Edition, John Wiley (1994).

    (9) William H. Kruskal & Judith M. Tanur,International Encyclopaedia of Statistics, CollierMacmillian Publishers, 1978. ISBN 0-02-917960-2.

    (10) Analytical Methods Committee, RobustStatistics How Not to Reject Outliers Part 2.

    Analyst1989 114, 16937.(11) D.C. Hoaglin, F. Mosteller & J.W. Tukey,

    Understanding Robust and Exploratory Data

    Analysis, John Wiley & Sons (1983), ISBN 0-471-09777-2.

    (12) M. Hollander & D.A. Wolf, Non-parametricstatistical methods, Wiley & Sons, New York1973.

    (13) W.W. Daniel,Applied non-parametric statistics,Houghton Mifflin, Boston 1978.

    (14) M Sargent VAM Bulletin Issue 13 45

    22

    21

    20

    19

    18

    17

    16

    15

    14

    13

    Analyte concentrationBox & Whisker Plot

    Laboratory ID

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    outlying variance

    outlying mean

    figure 4 Different types of outlier in grouped data.

    Types of comparison Parametric methods Non-parametric methods (12, 13)

    Differences between t-test for independent groups2 WaldWolfowitz runs test

    independent groups MannWhitney U test

    of data KolmogorovSmirnov two-sample

    test

    (ANOVA/MANOVA)2 KruskalWallis analysis of ranks.

    Median test

    Differences between t-test for dependent groups2 Sign test

    dependent groups Wilcoxons matched pairs test

    of data McNemars test

    2 (Chi-square) testFriedmans two-way ANOVA

    ANOVA with replication2 Cochran Q test

    Relationships between Linear regression3 Spearman R

    continuous variables Correlation coefficient3 Kendall

    Tau

    Homogeneity of Variance Bartletts test7 Levenes test, Brown & Forsythe

    Relationships between coefficient Gamma

    counted variables 2 (Chi-square) testPhi coefficient

    Fisher exact test

    Kendall coefficient of

    concordance