ali moula

Embed Size (px)

Citation preview

  • 7/30/2019 ali moula

    1/7

    Notes on Tukeys One Degree of Freedom Test for Interaction

    Suppose you have a 2-way analysis of variance problem with a single observation per cell.Let the factors be A and B (corresponding to rows and columns of your table of cells,respectively), and let A have a levels and B have b levels. (I.e. your table has a rows and bcolumns.)

    Let Yij be the (single) observation corresponding to cell (i, j) (i.e. row i and column j).The most elaborate model that you can fit is (in over parameterized form):

    Yij = + i + j + Eij

    That is, you cannot fit the full model which includes an interaction term you run outof degrees of freedom. The full model, which would be

    Yij = + i + j + ()ij + Eij

    would use up 1 + (a 1) + (b 1) + (a 1)(b 1) = 1 + a 1 + b 1 + ab a b + 1 = abdegrees of freedom. Of course your sample size is n = ab so this leaves 0 degrees of freedomfor error.

    (In some sense you can fit the model, but it fits exactly; your estimate of2 is 0. Its justlike fitting a straight line to precisely 2 data points.)

    Thus you cant fit an interaction term and so you cant test for interaction. Fitting the

    (additive) model to such a data set involves tacitly assuming that there is no interaction.Such an assumption may be rash.

    Clever old John Tukey [6] didnt like this situation much and (being Tukey) he was able tofigure out a way to get around the dilemma, at least partially. That is, Tukey figured outa way to test for interaction when you cant test for interaction. Its only a partial solutionbecause it only tests for a particular form of interaction, but its a lot better than nothing.Tukey described his procedure as using one degree of freedom for non-additivity.

    The clearest way to describe the model proposed by Tukey is to formulate it as

    Yij = + i + j + i j + Eij (1)

    where the i and the j have their usual meaning and is one more parameter to beestimated (hence using up that 1 degree of freedom).

    This was not (according to [3, page 7]) the way the Tukey originally formulated his proposedsolution to the testing for interaction problem. Other authors ([7, 5, 1, 2]) showed thatTukeys procedure is equivalent to fitting model (1).

    Now model (1) is a non-linear model due to those products of parameters ij whichappear. So you might think that fitting it would require some extra heavy-duty software.Wrong-oh! Despite the non-linearity, the model can be fitted using linear model fittingsoftware, e.g. the GLM procedure in Minitab. The fitting must be done in two steps, andits a bit kludgy in Minitab, but for simple problems at least its not too hard. The steps

    are as follows:

  • 7/30/2019 ali moula

    2/7

    1. Fit the additive model

    Yij = + i + j + Eij

    Then use the estimates of i and j to form a new variable equal to the product ofthese estimates, i.e. set

    zij = i + j

    2. Fit the modelYij = + i + j + zij + Eij

    It is important to remember that z is a covariate. The test for the significance of zforms the test for interaction.

    The kludgy bit in Minitab arises in the process of constructing the zij (and making sure theyget put into their column in the right order). The rest of the procedure is straightforward.Examples follow.

    Example 1: (Reproduces example 1.6.3, page 10 in [3].)

    An experiment was conducted to investigate the effect of temperature and humidity on thegrowth rate of sorghum plants. Ten plants of the same species are grown in each of 20 growthchambers; five temperature levels and 4 humidity levels are used. The 20 temperature-humidity combinations are randomly assigned to the 20 chambers. The heights of theplants were measured after four weeks. The experimental unit was taken to be growthchamber so the response was taken to be the average height (in cm.) of the plants in eachchamber.

    MTB > print c1-c3

    Row ht hum tmpr

    1 12.3 20 50

    2 19.6 40 50

    3 25.7 60 50

    4 30.4 80 50

    5 13.7 20 60

    6 16.9 40 60

    7 27.0 60 60

    8 31.5 80 60

    9 17.8 20 7010 20.0 40 70

    11 26.3 60 70

    12 35.9 80 70

    13 12.1 20 80

    14 17.4 40 80

    15 36.9 60 80

    16 43.4 80 80

    17 6.9 20 90

    18 18.8 40 90

    19 35.0 60 90

    20 53.0 80 90

    2

  • 7/30/2019 ali moula

    3/7

    MTB > anova ht = hum tmpr;

    SUBC> means hum tmpr.

    Factor Type Levels Values

    hum fixed 4 20, 40, 60, 80

    tmpr fixed 5 50, 60, 70, 80, 90

    Source DF SS MS F P

    hum 3 2074.30 691.43 20.72 0.000

    tmpr 4 136.62 34.15 1.02 0.434

    Error 12 400.45 33.37

    Total 19 2611.36

    Means

    hum N ht

    20 5 12.560

    40 5 18.540

    60 5 30.180

    80 5 38.840

    tmpr N ht

    50 4 22.000

    60 4 22.275

    70 4 25.000

    80 4 27.450

    90 4 28.425

    MTB > # Heres the kludgy bit. Sigh.

    MTB > set c4

    DATA> 5(12.56 18.54 30.18 38.84)

    DATA> end

    MTB > name c4 ahat

    MTB > set c5

    DATA> (22 22.275 25 27.45 28.425)4

    DATA> end

    MTB > name c5 bhat

    MTB > let c6 = (c4-mean(c1))*(c5-mean(c1))

    MTB > # Note that the numbers we set into c4 and c5 were the estimatedMTB > # row and column *means* which are estimates of mu + alpha_i

    MTB > # and mu + beta_j respectively. To get estimates of alpha_i

    MTB > # and beta_j we need to subtract the estimate of mu, whence

    MTB > # the -mean(c1) business in the expression for c6.

    MTB > name c6 z

    MTB > glm ht = hum tmpr z;

    SUBC> cova z.

    Factor Type Levels Values

    hum fixed 4 20, 40, 60, 80

    tmpr fixed 5 50, 60, 70, 80, 90

    3

  • 7/30/2019 ali moula

    4/7

    Analysis of Variance for ht, using Adjusted SS for Tests

    Source DF Seq SS Adj SS Adj MS F P

    hum 3 2074.30 2074.30 691.43 68.03 0.000

    tmpr 4 136.62 136.62 34.15 3.36 0.050

    z 1 288.65 288.65 288.65 28.40 0.000

    Error 11 111.79 111.79 10.16

    Total 19 2611.36

    Term Coef SE Coef T P

    Constant 25.0300 0.7129 35.11 0.000

    z 0.14273 0.02678 5.33 0.000

    From the above we see that the estimate of is = 0.1427 (to 4 decimal places). TheF statistic for the significance of z is 28.40; it has a p-value of 0 (to 3 decimal places)whence we can conclude with virtual certainty that there is interaction between humidityand temperature something we could not have checked upon without Tukeys help.

    Note that if we are careful, we can actually do the problem without correcting for theoverall mean when forming z. This goes as follows:

    MTB > let c6 = c4*c5 # No mean correction.

    MTB > GLM ht = hum + tmpr + z;

    SUBC> cova z;

    SUBC> ssqu 1. # This says use sequential SS.

    Factor Type Levels Values

    hum fixed 4 20, 40, 60, 80

    tmpr fixed 5 50, 60, 70, 80, 90

    Source DF Seq SS Adj SS Seq MS F P

    hum 3 2074.30 148.06 691.43 68.03 0.000

    tmpr 4 136.62 128.41 34.15 3.36 0.050

    z 1 288.65 288.65 288.65 28.40 0.000

    Error 11 111.79 111.79 10.16

    Total 19 2611.36

    Term Coef SE Coef T PConstant -64.39 16.79 -3.83 0.003

    z 0.14273 0.02678 5.33 0.000

    The results are the same as before at least those in which we are interested are the same!(The Constant term estimate is quite different.)

    Where we had to be careful in the foregoing was in telling Minitab to use sequential sums ofsquares, rather than the (default) adjusted sums of squares. If we had left in the defaultand not mean corrected the answer wouldve been wrong.

    4

  • 7/30/2019 ali moula

    5/7

    The same result can be achieved by regressing y on z with no constant term in the model.

    I.e. by fitting the model Yij = zij + Eij .

    MTB > regr c1 1 c6;

    SUBC> noco.

    The regression equation is

    ht = 0.143 z

    Predictor Coef SE Coef T P

    Noconstant

    z 0.1427 0.2349 0.61 0.551

    Analysis of Variance

    Source DF SS MS F P

    Regression 1 288.7 288.7 0.37 0.551

    Residual Error 19 14852.7 781.7

    Total 20 15141.4

    Note that we get the same estimate of namely 0.1427. To do the test for the significanceof z requires more work however. We need to form

    F =SSRreg/1

    (SSEadd

    SSRreg)/((a

    1)(b

    1)

    1)

    =287.5

    (400.45

    287.5)/11

    = 28.0

    (compare with 28.4 from the GLM method). Under the null hypothesis of no interactionthis has an F distribution on 1 and 11 degrees of freedom, so the observed value of 28 yieldsa p-value of 0.000.

    Example 2: (Taken from [4, problem 22, Chapter 10, page 441]

    Three different washing machines were used to test four different detergents. The responsewas a coded score of the effectiveness of each washing.

    MTB > print c1-c3

    Row effect mach dete1 53 mach.1 dete.1

    2 50 mach.2 dete.1

    3 59 mach.3 dete.1

    4 54 mach.1 dete.2

    5 54 mach.2 dete.2

    6 60 mach.3 dete.2

    7 56 mach.1 dete.3

    8 58 mach.2 dete.3

    9 62 mach.3 dete.3

    10 50 mach.1 dete.4

    11 45 mach.2 dete.4

    12 57 mach.3 dete.4

    5

  • 7/30/2019 ali moula

    6/7

    MTB > anova effect mach dete;MTB > anova effect = mach dete;

    SUBC> means mach dete.

    Factor Type Levels Values

    mach fixed 3 mach.1, mach.2, mach.3

    dete fixed 4 dete.1, dete.2, dete.3, dete.4

    Source DF SS MS F P

    mach 2 135.167 67.583 18.29 0.003

    dete 3 102.333 34.111 9.23 0.011

    Error 6 22.167 3.694

    Total 11 259.667

    Means

    mach N effect

    mach.1 4 53.250

    mach.2 4 51.750

    mach.3 4 59.500

    dete N effect

    dete.1 3 54.000

    dete.2 3 56.000

    dete.3 3 58.667

    dete.4 3 50.667

    MTB > set c4

    DATA> 4(53.25 51.75 59.5)

    DATA> end

    MTB > name c4 ahat

    MTB > set c5

    DATA> (54 56 58.667 50.667)3

    DATA> end

    MTB > name c5 bhat

    MTB > let c6 = c4*c5

    MTB > name c6 z

    MTB > glm effect = mach dete z;SUBC> cova z;

    SUBC> ssqu 1.

    Factor Type Levels Values

    mach fixed 3 mach.1, mach.2, mach.3

    dete fixed 4 dete.1, dete.2, dete.3, dete.4

    . . . . . . (continued over page)

    6

  • 7/30/2019 ali moula

    7/7

    Analysis of Variance for effect, using Sequential SS for Tests

    Source DF Seq SS Adj SS Seq MS F P

    mach 2 135.167 16.012 67.583 31.62 0.001

    dete 3 102.333 15.998 34.111 15.96 0.005

    z 1 11.479 11.479 11.479 5.37 0.068

    Error 5 10.688 10.688 2.138

    Total 11 259.667

    Term Coef SE Coef T P

    Constant 354.9 129.5 2.74 0.041

    z -0.09979 0.04306 -2.32 0.068

    Note that the foregoing example was done without mean correcting but rather by usingsequential sums of squares.

    In this example we get a hint of interaction (p-value = 0.068) but the term is not significantat the 0.05 level, only at the slack 0.10 level.

    Note that we cant actually say that there is a lack of evidence of interaction. Only thatthere is a lack of evidence of this sort the Tukey sort of interaction. There could beother sorts of interaction which we cannot detect by this method lurking in the weeds.

    References

    [1] F. A. Graybill. An Introduction to Linear Statistical Models, Volume 1. McGraw-Hill,New York, 1961.

    [2] F. A. Graybill. Theory and Application of the Linear Model. Duxbury, North Scituate,Mass., 1976.

    [3] George A. Milliken and Dallas E. Johnson. Analysis of Messy Data Volume 2 Nonrepli-cated Experiments. van Nostrand Reinhold, New York, 1989.

    [4] Sheldon Ross. Introduction to Probability and Statistics for Engineers and Scientists.Harcourt Academic Press, San Diego, second edition, 2000.

    [5] Henry Scheffe. The Analysis of Variance. John Wiley, New York, 1959.

    [6] John W. Tukey. One degree of freedom for non-additivity. Biometrics, 5:232242, 1949.

    [7] G. C. Ward and I. D. Dick. Non-additivity in randomized block designs and balancedincomplete block designs. New Zealand Journal of Science and Technology, 33:430435,1952.

    7