Sampling, Regression, Experimental Design and Analysis …people.stat.sfu.ca/~cschwarz/Stat-650/Notes/PDF/ChapterANCOVA.pdf · Sampling, Regression, Experimental Design and Analysis

Sampling, Regression, Experimental Design andAnalysis for Environmental Scientists,

Biologists, and Resource Managers

C. J. SchwarzDepartment of Statistics and Actuarial Science, Simon Fraser University

[email protected]

January 16, 2011

Contents

101Analysis of Covariance - ANCOVA 2101.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2101.2Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6101.3Comparing individual regression lines . . . . . . . . . . . . . . . 6101.4Comparing Means after covariate adjustments . . . . . . . . . . . 10101.5Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . 11101.6Example - Degradation of dioxin . . . . . . . . . . . . . . . . . . 11101.7Change in yearly average temperature with regime shifts . . . . . 27101.8Example - More refined analysis of stream-slope example . . . . 35101.9Comparing Fulton’s Condition Factor K . . . . . . . . . . . . . . 48101.10Final Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

1

Chapter 101

Analysis of Covariance -ANCOVA

101.1 Introduction

In previous chapters, we looked at comparing group means from data collectedfrom a single-factor completely randomized design and analyzed using ANOVA.We also looked at estimating the slope of a straight line between two variables.In both cases the response variable, Y , was continuous (interval or ratio scale).In the case of ANOVA, the X variables was nominal or ordinal in scale andserved to identify the treatment groups. In the regression setting, the X variablewas also continuous.

The Analysis of Covariance (ANCOVA) is a combination of both analyses.Groups are identified by a nominal or ordinal scale variable and a continuouscovariate is also measured.

There are two uses of ANCOVA which, on the surface, appear to be separateanalyses. In fact, both analyses are identical.

The first use is to check if the regression line for the groups are parallel.If there is evidence that the individual regression lines are not parallel, then aseparate regression line must be fit for each group for prediction purposes. Ifthere is no evidence of non-parallelism, then the next task is to see if the linesare co-incident, i.e. have both the same intercept and the same slope. If thereis evidence that the lines are not coincident, then a series of parallel lines are fitto the data. All of the data are used to estimate the common slope. If there is

2

CHAPTER 101. ANALYSIS OF COVARIANCE - ANCOVA

no evidence that the lines are not coincident, then all of the data can be simplypooled together and a single regression line fit for all of the data.

The three possibilities are shown below for the case of two groups - theextension to many groups is obvious:

c©2010 Carl James Schwarz 3




Second, ANCOVA has been used to test for differences in means among thegroups when some of the variation in the responsible variable can be “explained”by a covariate. For example, the effectiveness of two different diets can becompared by randomizing people to the two diets and measuring the weightchange during the experiment. However, some of the variation in weight changemay be related to initial weight. Perhaps by “standardizing” everyone to somecommon weight, we can more easily detect differences among the groups.

Insert graphs hereA very nice book on the Analysis of Covariance is Analysis of Messy Data,

Volume III: Analysis of Covariance by G. A. Milliken and D. E. Johnson. De-tails are available at http://www.statsnetbase.com/ejournals/books/book_summary/summary.asp?id=869.


http://www.statsnetbase.com/ejournals/books/book_summary/summary.asp?id=869

http://www.statsnetbase.com/ejournals/books/book_summary/summary.asp?id=869


101.2 Assumptions

As before, it is important before the analysis is started to verify the assump-tions underlying the analysis. As ANCOVA is a combination of ANOVA andRegression, the assumptions are similar. Both goals of ANCOVA have similarassumptions:

• The response variable Y is continuous (interval or ratio scaled)

• The data are collected under a completely randomized design. 1 This im-plies that the treatment must be randomized completely over the entire setof experimental units if an experimental study, or units must be selectedat random from the relevant populations if an observational study.

• There must be no outliers. Plot Y vs X for each group separately to seeif there are any points that don’t appear to follow the straight line.

• The relationship between Y and X must be linear for each group. 2

Check this assumption by looking at the individual plots of Y vs X foreach group.

• The variance must be equal for both groups around their respective regres-sion lines. Check that the spread of the points is equal around the rangeof X and that the spread is comparable between the two groups. Thiscan be formally checked by looking at the MSE from a separate regressionline for each group as MSE estimates the variance of the data around theregression line.

• The residuals must be normally distributed around the regression line foreach group. This assumption can be check by examining the residual plotsfrom the fitted model for evidence of non-normality. For large samples, thisis not too crucial; for small sample sizes, you will likely have inadequatepower to detect anything but gross departures.

101.3 Comparing individual regression lines

You saw in earlier chapters, that a statistical model is a powerful shorthandto describe what analysis is fit to a set of data. The model must describe thetreatment structure, the experimental unit structure, and the randomizationstructure.. Let Y be the response variable; X be the continuous X-variable, andGroup be the group factor.

1It is possible to relax this assumption - this is beyond the scope of this course.2It is possible to relax this assumption as well, but is again, beyond the scope of this course.



In all cases that follow, we are assuming that a completely randomized designwas used for the randomization structure. This implies that there are no explicitterms for the randomization structure in the model.

Similarly, there is a single size of experimental unit with no blocking or sub-sampling occurring. This also implies there will be no terms in the model forthe experimental unit structure. In more advanced courses, the analyses in thischapter can be extended to more complex designs.

In earlier chapters, we saw that the model for a single-factor completelyrandomized design is

Y = Group

This is read as saying that variation in Y can be partially explained by an overallgrand mean (never specified) with differences in the mean caused by Groups plusan implicit random noise (which is never specified).

Again, from an earlier chapter, we say that the model for a regression of Yon X is

Y = X

This is read as saying that the variation in Y can be partially explained by anintercept (never specified) plus changes in the X plus an implicit random noise(which is never specified).

As ANCOVA is a combination of the above two analyses, it will not besurprising that the models will have terms corresponding to both Group andX. Again, there are three cases:

If the lines for each group are not parallel:



the appropriate model is

Y 1 = Group X Group ∗X

The terms can be in any order. This is read as variation in Y can be explaineda common intercept (never specified) followed by group effects (different inter-cepts), a common slope on X, and an “interaction” between Group and X whichis interpreted as different slopes for each group. This model is almost equivalentto fitting a separate regression line for each group. The only advantage to usingthis joint model for all groups is similar to that enjoyed by using ANOVA - allof the groups contribute to a better estimate of residual error. If the numberof data points per group is small, this can lead to improvements in precisioncompared to fitting each group individually.

If the lines are parallel across groups, but not coincident:



the appropriate model isY 2 = Group X

The terms can be in any order. The only difference between this and the previousmodel is that this simpler model lacks the Group*X “interaction” term. It wouldnot be surprising then that a statistical test to see if this simpler model is tenablewould correspond to examining the p-value of the test on the Group*X term fromthe complex model. This is exactly analogous to testing for interaction effectsbetween factors in a two-factor ANOVA.

Lastly, if the lines are co-incident:



the appropriate model isY 3 = X

. Now the difference between this model and the previous model is the Groupterm that has been dropped. Again, it would not be surprising that this corre-sponds to the test of the Group effect in the formal statistical test. The test forco-incident lines should only be done if there is insufficient evidence against thehypothesis of parallelism.

While it is possible to test for a non-zero slope, this is rarely done.

101.4 Comparing Means after covariate adjust-ments

to be added later



101.5 Power and sample size

to be added later- use the MSE as the estimate of variance for testing MEANS and for testing

the slope.

101.6 Example - Degradation of dioxin

An unfortunate byproduct of pulp-and-paper production used to be dioxins -a very hazardous material. This material was discharged into waterways withthe pulp-and-paper effluent where it bioaccumulated in living organisms such acrabs. Newer processes have eliminated this by product, but the dioxins in theorganisms takes a long time to degrade.

Government environmental protection agencies take samples of crabs fromaffected areas each year and measure the amount of dioxins in the tissue. Thefollowing example is based on a real study.

Each year, four crabs are captured from two monitoring stations which aresituated quite a distance apart on the same inlet where the pulp mill was lo-cated.. The liver is excised and the livers from all four crabs are compositedtogether into a single sample. 3 The dioxins levels in this composite sample ismeasured. As there are many different forms of dioxins with different toxicities,a summary measure, called the Total Equivalent Dose (TEQ) is computed fromthe sample.

As seen in the chapter on regression, the appropriate response variable islog(TEQ).

Is the rate of decline the same for both sites? Did the sites have the sameinitial concentration?

Here are the raw data which are also available in the dataset dioxin2.jmp inthe Sample Program Library at SampleProgramLibrary.

3Compositing is a common analytical tool. There is little loss of useful information inducedby the compositing process - the only loss of information is the among individual-samplevariability which can be used to determine the optimal allocation between samples withinyears and the number of years to monitor.


SampleProgramLibrary


Site Year TEQ log(TEQ)a 1990 179.05 5.19a 1991 82.39 4.41a 1992 130.18 4.87a 1993 97.06 4.58a 1994 49.34 3.90a 1995 57.05 4.04a 1996 57.41 4.05a 1997 29.94 3.40a 1998 48.48 3.88a 1999 49.67 3.91a 2000 34.25 3.53a 2001 59.28 4.08a 2002 34.92 3.55a 2003 28.16 3.34b 1990 93.07 4.53b 1991 105.23 4.66b 1992 188.13 5.24b 1993 133.81 4.90b 1994 69.17 4.24b 1995 150.52 5.01b 1996 95.47 4.56b 1997 146.80 4.99b 1998 85.83 4.45b 1999 67.72 4.22b 2000 42.44 3.75b 2001 53.88 3.99b 2002 81.11 4.40b 2003 70.88 4.26

The data can be entered into JMP in the usual fashion. Make sure that Siteis a nominal scale variable, and that Year is a continuous variable.

In cases with multiple groups, it is often helpful to use a different plottingsymbol for each group. This is easily accomplished in JMP by selecting the rows(say for site a) and using the Rows->Markers to set the plotting symbol for theselected rows:



The final data sheet has two different plotting symbols for the two sites:



Before fitting the various models, begin with an exploratory examination ofthe data looking for outliers and checking the assumptions.

Each year’s data is independent of other year’s data as a different set ofcrabs was selected. Similarly, the data from one site are independent from theother site. This is an observational study, so the question arises of how exactlywere the crabs were selected? In this study, crab pots were placed on the floorof the sea to capture the available crabs in the area.



Use the Analyze->Fit Y-by-X platform and specify the log(TEQ) as the Yvariable, and Y ear as the X variable:

Then specify a grouping variable by clicking on the pop-down menu near theBivariate Fit title line:



and selecting Site as the grouping variable:



Now select the Fit Line from the same pop-down menu:



to get separate lines fit for each group:



This relationships for each site appear to be linear. The actual estimates arealso presented:



The scatterplot doesn’t show any obvious outliers. The estimated slope forthe a site is -.107 (se .02) while the estimated slope for the b site is -.06 (se .02).The 95% confidence intervals (not shown on the output but available by right-clicking/ctrl-clicking on the parameter estimates table) overlap considerably sothe slopes could be the same for the two groups.

The MSE from site a is .10 and the MSE from site b is .12. This correspondsto standard deviations of

√.10 = .32 and

√.12 = .35 which are very similar so

that assumption of equal standard deviations seems reasonable.

The residual plots (not shown) also look reasonable.

The assumptions appear to be satisfied, so let us now fit the various models.

First, fit the model allowing for separate lines for each group. The Analyze->Fit Model platform is used:



The terms can be in any order and correspond to the model described earlier.This gives the following output:



The regression plot is just the same as the plot of the two individual lines seenearlier. What is of interest is the Effect test for the Site*year interaction. Herethe p-value is not very small, so there is no evidence that the lines are notparallel.

We need to refit the model, dropping the interaction term:



which gives the following regression plot:



This shows the fitted parallel lines. The effect tests:

now have a small p-value for the Site effect indicating that the lines are notcoincident, i.e. they are parallel with different intercepts. This would meanthat the rate of decay of the dioxin appears to be equal in both sites, but theinitial concentration appears to be different.

The estimated (common) slope is found in the Parameter Estimates portionof the output:



and has a value of -.083 (se .016). Because the analysis was done on the log-scale,this implies that the dioxin levels changed by a factor of exp(−.083) = .92 fromyear to year, i.e. about a 8% decline each year. The 95% confidence interval forthe slope on the log-scale is from (-.12 -> -.05) which corresponds to a potentialfactor between exp(−.12) = .88 to exp(−.05) = .95 per year, i.e. between a 12%and 5% decline per year. 4

While it is possible to estimate the difference between the parallel lines fromthe Parameter Estimates table, it is easier to look at the section of the outputcorresponding to the Site effects. Here the estimated LSMeans correspond tothe log(TEQ) at the average value of Year - not really of interest. As in previouschapters, the difference in means is often of more interest than the raw meansthemselves. This is found by using the pop-down menu and selecting a LSMeansContrast or Multiple Comparison procedure to give:

4The confidence intervals are found by right clicking/ctrl-clicking in the Parameter Esti-mates table



The estimated difference between the lines (on the log-scale) is estimated to be0.46 (se .13). Because the analysis was done on the log-scale, this correspondsto a ratio of exp(.46) = 1.58 in dioxin levels between the two sites, i.e. site bhas 1.58 times the dioxin level as site a. Because the slopes are parallel anddeclining, the dioxin levels are falling in both sites, but the 1.58 times ratioremains consistent.

Finally, the actual by Predicted plot (not shown here), the leverage plots(not shown here) and the residual plot



don’t show any evidence of a problem in the fit.

101.7 Change in yearly average temperature withregime shifts

The ANCOVA technique can also be used for trends when there are KNOWNregime shifts in the series. The case when the timing of the shift is unknown ismore difficult and not covered in this course.

For example, consider a time series of annual average temperatures measuredat Tuscaloosa, Alabama from 1901 to 2001. It is well known that shifts intemperature can occur whenever the instrument or location or observer or othercharacteristics of the station change.

The data are available in the JMP datafile tuscaloosa-avg-temp.jmp in theSample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

A portion of the raw data is shown below:


http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms



and a time series plot of the data:

shows a shift in the readings in 1939 (thermometer changed), 1957 (stationmoved), and possibly in 1987 (location and thermometer changed).

It turns out that cases where the number of epochs tends to increase withthe number of data points has some serious technical issues with the propertiesof the estimators. See

Lu, Q. and Lund, R.B. (2007).Simple linear regression with multiple level shifts.Canadian Journal of Statistics, 35, 447-458



for details. Basically, if the number of parameters tends to increase with samplesize, this violates one of the assumptions for maximum likelihood estimation.This would lead to estimates which may not even be consistent! For example,suppose that the recording changed every two years. Then the two data pointsshould still be able to estimate the common slope, but this corresponds tothe well known problem with case-control studies where the number of pairsincreases with total sample size. Fortunately, Lu and Lund (2007) showed thatthis violation is not serious.

The analysis proceeds as in the dioxin example with two sites, except thatnow the series is broken into different epochs corresponding to the sets of yearswhen conditions remained stable at the recording site. In this case, this cor-responds to the years 1901-1938 (inclusive); 1940-1956 (inclusive); 1958-1986(inclusive), and 1989-2000 (inclusive). Note that the years 1939, 1957, and1987 are NOT used because the average temperature in these two years is anamalgam of two different recording conditions5.

For example, the data file (around the first regime change) may look like:

Note that year and Avg Temp and both set to have continuous scale; butepoch should have a nominal or ordinal scale.

Model filling proceeds as before by first the model:

AvgTemp = Y ear Epoch Y ear ∗ Epoch

to see if the change in AvgTemp is consistent among Epochs and then fitting themodel:

AvgTemp = Y ear Epoch

to estimate the common trend (after adjusting for shifts among the Epochs.

The Analyze->Fit Model platform is used:5If the exact day of the change were known, it is possible to weight the two epochs in these

years and include the data points.



There is no strong evidence that the slopes are different among the epochs(p=.10) despite the plot showing a potentially differential slope in the 3rd epoch:



The simpler model with common slopes is then fit:



with fitted (common slope) lines:

No further model simplification is possible and there is evident that the commonslope is different from zero:

The estimated change in average temperature is:



i.e. an estimated increase of .033 (SE .006) per year. The 95% confidenceinterval does not cover 0.

The residual plots (against predicted and the order in which the data werecollected):

shows no obvious problems.

Whenever time series data are used, autocorrelation should be investigated.The Durbin-Watson test is applied to the residuals:



with no obvious problem detected.

The leverage plot (against year)



also reveals nothing amiss.

A more sophisticated analysis can be fit using SAS, but isn’t needed. Thesample program and output are available in the Sample Program Library.

101.8 Example - More refined analysis of stream-slope example

In the chapter on paired comparisons, the example of the effect of stream slopewas examined based on:

Isaak, D.J. and Hubert, W.A. (2000). Are trout populations af-fected by reach-scale stream slope. Canadian Journal of Fisheriesand Aquatic Sciences, 57, 468-477.



In that paper, stream slope was (roughly) categorized into high or low slopeclasses and a paired-analysis was performed. In this section, we will use theactual stream slopes to examine the relationship between fish density and streamslope.

Recall that a stream reach is a portion of a stream from 10 to several hundredmetres in length that exhibits consistent slope. The slope influences the generalspeed of the water which exerts a dominant influence on the structure of physicalhabitat in streams. If fish populations are influenced by the structure of physicalhabitat, then the abundance of fish populations may be related to the slope ofthe stream.

Reach-scale stream slope and the structure of associated physical habitatsare thought to affect trout populations, yet previous studies confound the effectof stream slope with other factors that influence trout populations.

Past studies addressing this issue have used sampling designs wherein datawere collected either using repeated samples along a single stream or measuringmany streams distributed across space and time.

Reaches on the same stream will likely have correlated measurements makingthe use of simple statistical tools problematical. [Indeed, if only a single streamis measured on multiple locations, then this is an example of pseudo-replicationand inference is limited to that particular stream.]

Inference from streams spread over time and space is made more difficultby the inter-stream differences and temporal variation in trout populations ifsamples are collected over extended periods of time. This extra variation reducesthe power of any survey to detect effects.

For this reason, a paired approach was taken. A total of twenty-three streamswere sampled from a large watershed. Within each stream, two reaches wereidentified and the actual slope gradient was measured.

In each reach, fish abundance was determined using electro-fishing methodsand the numbers converted to a density per 100 m2 of stream surface.

Table 101.1 presents the (fictitious but based on the above paper) raw data

Estimates of fish density from a paired experimentslope slope density

Stream (%) class (per 100 m2)1 0.7 low 15.01 4.0 high 21.02 2.4 low 11.02 6.0 high 3.1



3 0.7 low 5.93 2.6 high 6.44 1.3 low 12.24 4.0 high 17.65 0.6 low 6.25 4.4 high 7.06 1.3 low 39.86 3.2 high 25.07 2.0 low 6.57 4.2 high 11.28 1.3 low 9.68 4.2 high 17.59 2.0 low 7.39 3.6 high 10.010 0.7 low 11.310 3.5 high 21.011 2.3 low 12.111 6.0 high 12.112 2.5 low 13.212 4.2 high 15.013 2.3 low 5.013 6.0 high 5.014 1.2 low 10.214 2.9 high 6.015 0.7 low 8.515 2.9 high 7.016 1.1 low 5.816 3.0 high 5.017 2.2 low 5.117 5.0 high 5.018 0.7 low 65.418 3.2 high 55.019 0.7 low 13.219 3.0 high 15.020 0.3 low 7.120 3.2 high 12.021 2.3 low 44.821 7.0 high 48.022 1.8 low 16.022 6.0 high 20.023 2.2 low 7.223 6.0 high 10.1



Notice that the density varies considerably among stream but appears to befairly consistent within each stream.

The raw data is available in a JMP datafile called paired-stream.jmp in theSample Programs Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms..

As noted earlier, this is an example of an Analytical Survey. The treatments(low or high slope) cannot be randomized within stream – the randomizationoccurs by selecting streams at random from some larger population of potentialstreams. As noted in the early chapter on Observational Studies, causal infer-ence is limited whenever a randomization of experimental units to treatmentscannot be performed.

Unlike the example presented in other chapters where the slope is divided(arbitrarily) into two class (low and high slope), we will now use the actual slope.A simple regression CANNOT be used because of the non-independence intro-duced by measuring two reaches on the same stream. However, an ANOCOVAwill prove to be useful here.

First, it seem sensible that the response to stream slope will will be multi-plicative rather than additive, i.e. an increase in the stream slope will changethe fish density by a common fraction, rather than simply changing the densityby a fixed amount. For example, it may turn out that a 1 unit change in theslope, reduces density by 10% - if the density before the change was 100 fish/m2,then after the change, the new density will be 90 fish/m2. Similarly, if the orig-inal density was only 10 fish/m2, then the final density will be 9 fish/m2. Inboth cases, the reduction is a fixed fraction, and NOT the same fixed amount(a change of 10 vs 1).

Create the log(density) column in the usual fashion (not illustrated here). Incases like this, the natural logarithm is preferred because the resulting estimateshave a very nice simple interpretation. 6

An appropriate model will be one where each stream has a separate intercept(corresponding to the different productivities of each stream - acting like ablock), with a common slope for all streams. The simplified model syntaxwould look like

log(density) = stream slope

where the term stream represents a nominal scaled variable and gives the dif-ferent intercepts and the term slope is the effect of the common slope on thelog(density).

6The jmp dataset also created a different plotting symbol for each stream using theRows− > Color or Mark by Column menu.





This is fit using the Analyze->Fit Model platform as:

Note that it stream must have a nominal scale and that slope must have acontinuous scale. The order of the terms in the effects box is not important.

The output from the Analyze->Fit Model platform is voluminous, but acareful reading reveals several interesting features.

First is a plot of the common slope fit to each stream:



This shows a gradual increase as slope increases. This plot is hard to interpret,but a plot of observed vs predicted values is clearer:



Generally, the observed are close to the predicted values except for two potentialoutliers. By clicking on these points, it is shown that both points belong tostream 2 where it appears that the slope increases causes a large decrease indensity contrary to the general pattern seen in the the other streams.

The effect tests:

fail to detect any influence of slope. Indeed the estimated coefficient associatedwith a change in slope is found to be:



is estimated to be .025 (se .0299) which is not statistically significant. 7

Residual plots also show the odd behavior of stream 2:

If this rogue stream is “eliminated” from the analysis, the the resulting plotsdo not show any problems (try it), but now the results are statistically significant(p=.035):

The estimated change in log-density per percentage point change in the slopeis found to be:

7Because the natural log transform was used and the data on the log scale was used,“smallish” slope coefficients have an approximate interpretation. In this example, a slope of.025 on the (natural) log scale implies that the estimated fish density INCREASES by 2.5%every time the slope increases by one percentage point.



i.e. the slope is .05 (se .02) which is interpreted that a percentage point increasein stream slope increases fish density by 5%. 8

The remaining residual plot and leverage plots show no problems.

Yet another alternate analysis!

Because the treatment only has two levels, the same answers can also beobtained by estimating the ratio of the change in the log(density) to the changein slope. 9 To begin, we need to split the data table so that both the log(density)and the slope are in separate columns:

8This easy interpretation occurs because the natural log transform was used. If the common(base 10) log transform was used, there is no longer such a simple interpretation.

9If the slope-class had three or more levels, this analysis could not be done, and the previousanalysis would the preferred route



This creates a data table with separate columns for the log(density) and thestream slope for both the high and low slope categories:

Now create two new variables (create new columns and write a formula for eachcolumn) representing the differences in the log(density) and slope between thehigh and low slope classes:

Finally, we wish to fit a line through the origin through these data points.We use the Analyze->Fit Y-by-X platform,



the Fit Special from the red-triangle drop down menu:



and then check the Constrain intercept



This give the following output:



We obtain the same estimated effect and se . The outlier from stream 2 isreadily evident. When this outlier is excluded and the analysis is repeated,again a statistically significant result is obtained that matches the previousanalysis.

101.9 Comparing Fulton’s Condition Factor K

Not all fish within a lake are identical. How can a single summary measure bedeveloped to represent the condition of fish within a lake?



In general, the the relationship between fish weight and length follows apower law:

W = aLb

where W is the observed weight; L is the observed length, and a and b arecoefficients relating length to weight. The usual assumption is that heavier fishof a given length are in better condition than than lighter fish. Condition indicesare a popular summary measure of the condition of the population.

There are at least eight different measures of condition which can be foundby a simple literature search. Conne (1989) raises some important questionsabout the use of a single index to represent the two-dimensional weight-lengthrelationship.

One common measure is Fulton’s10 K:

K =Weigt

(Length/100)3

This index makes an implicit assumption of isometric growth, i.e. as the fishgrows, its body proportions and specific gravity do not change.

How can K be computed from a sample of fish, and how can K be comparedamong different subset of fish from the same lake or across lakes?

The B.C. Ministry of Environment takes regular samples of rainbow troutusing a floating and a sinking net. For each fish captured, the weight (g), length(mm), sex, and maturity of the fish was recorded.

The data are available in the rainbow-condition.jmp data file in the SampleProgram Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

A portion of the raw data data appears below:10There is some doubt about the first authorship of this condition factor. See Nash, R.

D. M., Valencia, A. H. and Geffen, A. J. (2005). The Origin of Fulton’s Condition Factor –Setting the Record Straight. Fisheries, 31, 236-238.





K was computed for each individual fish, and the resulting histogram isdisplayed below:

There is a range of condition numbers among the individual fish with an average(among the fish caught) K of about 13.6.

Deriving a single summary measure to represent the entire population of fishin the lake depends heavily on the sampling design used to capture fish.

Some case must be taken to ensure that the fish collected are a simple randomsample from the fish in the population. If a net of a single mesh size are used,then this has a selectivity curve and the nets are typically more selective for fishof a certain size. In this experiment, several different mesh sizes were used totry and ensure that all fish of all sizes have an equal chance of being selected.

As well, if regression methods have an advantage in that a simple randomsample from the population is no longer required to estimate the regression co-efficients. As an analogy, suppose you are interested in the relationship betweenyield of plants and soil fertility. Such a study could be conducted by finding



a random sample of soil plots, but this may lead to many plots with similarfertility and only a few plots with fertility at the tails of the relationship. Analternate scheme is to deliberately seek out soil plots with a range of fertilitiesor to purposely modify the fertility of soil plots by adding fertilizer, and thenfit a regression curve to these selected data points.

Fulton’s index is often re-expressed for regression purposes as:

W = K

(L

100

)3

This looks like a simple regression between W and(

L100

)3 but with no intercept.

A plot of these two variables:

shows a tight relationship among fish but with possible increasing variance withlength.

There is some debate about the proper way to estimate the regression co-efficient K. Classical regression methods (least squares) implicitly implies thatall of the “error” in the regression is in the vertical direction, i.e. conditions



on the observed lengths. However, the structural relationship between weightand length likely is violated in both variables. This would lead to the error-in-variables problem in regression, which has a long history. Fortunately, therelationship between the two variables is often sufficiently tight that it reallydoesn’t matter which method is used to find the estimates.

JMP can be used to fit the regression line constraining the intercept to bezero by using the Fit Special option under the red-triangle:



This gives rise to the fitted line and statistics about the fit:



Note that R2 really doesn’t make sense in cases where the regression is forcedthrough the origin because the null model to which it is being compared is the



line Y = 0 which is silly.11 For this reason, JMP does not report a value of R2.

The estimated value of K is 13.72 (SE .099).

The residual plot:

shows clear evidence of increasing variation with the length variable. This usu-ally implies that a weighted regression is needed with weights proportional tothe 1/length2 variable. In this case, such a regression gives essentially the sameestimate of the condition factor (K̂ = 13.67, SE = .11).

Comparing condition factors

This dataset has a number of sub-groups – do all of the subgroups have thesame condition factor? For example, suppose we wish to compare the K valuefor immature and mature fish. As noted by Garcia-Berthou (2001)12, this isbest done through a technique called Analysis of Covariance (ANCOVA). Somedetails on ANCOVA are presented in a separate chapter of these notes.

As outlined in the ANCOVA chapter, we start with a model that has aseparate K for each maturity class. The simplified syntax for this model is:

W = (Len/100)3 (Len/100)3 ∗Maturity

11 Consult any of the standard references on regression such as Draper and Smith for moredetails.

12 Garcia-Berthou E. (2001). On the misuse of residuals in ecology: testing regressionresiduals vs. the analysis of covariance. Journal of Animal Ecology 70, 708-711. http://dx.doi.org/10.1046/j.1365-2656.2001.00524.x


http://dx.doi.org/10.1046/j.1365-2656.2001.00524.x

http://dx.doi.org/10.1046/j.1365-2656.2001.00524.x


Note that unlike traditional ANOCOVA models, the model is lacking the sim-ple effect of maturity. The reason for this is that unlike traditional ANCOVAmodels, the intermediate model with parallel slopes really doesn’t make sensewhen the regression lines are forced through the origin. This syntax specifiesthat variation in length are attributable to variations in length and an interac-tion between the two variables. This latter term represents the differential Kbetween the maturity classes.

Here is where some care must be taken. By default. JMP “centers” (i.e. sub-tracts the mean) continuous X variables when they participate in an interactionor similar term:

Hence, if you just try and implement this above model directly in JMP, you willactually fit the model:

W = (Len/100)3 (Len/100)3 − Len/100)3) ∗Maturity

which, when expanded, actually adds an intercept term to the model. Ordi-narily, in regression models with intercepts, this would NOT be a problem – itis because the model is being forced through the intercept that this causes aproblem.

In order to prevent JMP from “centering” the length variable when fittingthese ANCOVA models, turn off the centering option (by unchecking the option)when the model is fit using the Analyze->Fit Model platform of JMP:



Note the use of the No Intercept option to again force the line through theorigin. JMP will ‘complain’ about the odd form of the model because it ismissing the simple maturity class effect, but just ignore the complaints. Thisgives the summary output for the effect test of:

The p-value for the last term in the table of 0.027 indicates that there is strongevidence of a different K between the two maturity classes.

The estimates for the separate maturity classes are obtained from the CustomTest option (some knowledge of the design matrix coding for categorical variablesin JMP is needed to know that JMP uses a (1, -1) coding for indicator variableswith 2 classes):



which gives the estimated K for each maturity class.



If you fit a separate regression for the two maturity classes (use the By optionon the fit model box), you will get the two same estimates. The respectivestandard errors will be slightly different because the single model is able topool over all of the data to estimate the standard errors, but separate estimatescannot do any pooling.

The separate fitted lines are shown below:



Similarly, a comparison of K can be made among the three sex classes (M,F, and U) where immature fish cannot be sexed and are given the code U, butmature fish are further subdivided into M and F classes (don’t forget to uncheckthe centering option in the triangle in the upper left corner of the Analyze->FitModel dialogue:

also shows evidence (p=.025) of a differential K among the three sex classes(this is not unexpected), and a contrast can be done to see if there is furtherevidence of a difference between the male and females:



As the p-value is .0074, there is also strong evidence of a differential K betweenthe males and females as well.

A final plot of the three lines is:



Finally, because you have replicate fish at the same body length, it is possibleto a formal lack-of-fit test. The idea behind this test is to compare the variationin data points at the same replicate lengths (pure error) with the deviationsaround the line from the model (model error). If the model fits well, the ratioof these two estimates of residual variance should be comparable:

The p-value for the lack-of-fit test is quite large indicating no evidence of a lackof fit.

This same ANCOVA method can be used to compare the K values acrosslakes or across time within the same lake. If you have a large number of lakeseach measured multiple times, some very interesting models can be fit that arebeyond the scope of these notes – please contact me. Similarly, interest maylie in modeling the K as functions of other lake-specific covariates such as lakesize, productivity, etc. Again, please contact me as this is beyond the scope of



these notes.

Statistical significance is not the same as biological significance! Whilethere was evidence of differential K in this data set, this statistical significancedoes not imply biological importance. I have no idea of the observed differencesin K among these three groups has any meaning biologically.

101.10 Final Notes

Some sections need to be added here on the following topics:

• danger of ANCOVA is there is no overlap in the covariate

• choice between paired t-test, multi-variate test, or ANCOVA in the caseof two time points

-


Documents

Sampling, Regression, Experimental Design and Analysis …people.stat.sfu.ca/~cschwarz/Stat-650/Notes/PDF/ChapterANCOVA.pdf · Sampling, Regression, Experimental Design and Analysis