Discrete Distributions - SAS

Endorsement from Gregory E. Gilbert “The intermediate SAS® user with a solid background in probability theory will benefit most from this book. Of particular help are the first two chapters where the author lays the foundation for the discrete distributions and log-linear models. The novice reader who feels comfortable with the topics of probability and statistical distributions will benefit most from this presentation, while those with more experience will benefit from these chapters as review material. The conversational tone makes Advanced Log-Linear Models Using SAS® a pleasure to read, and the numerous examples makes this newest addition to the SAS® BBU library a useful text for those users who need a reference text and those who need an application-oriented text.”

Gregory E. Gilbert, MSPH Research Associate

Dean's Office, College of Medicine Medical University of South Carolina

Advanced Log-LinearModels Using SAS®

Daniel Zelterman

The correct bibliographic citation for this manual is as follows: Zelterman, Daniel. 2002. Advanced Log-Linear Models Using SAS®. Cary, NC: SAS Institute Inc.

Advanced Log-Linear Models Using SAS®

Copyright © 2002 by SAS Institute Inc., Cary, NC, USA

ISBN 1-59047-080-X

All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.

U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st printing, October 2002 SAS Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential. For more information about our e-books, e-learning products, CDs, and hardcopy books, visit the SAS Publishing Web site at www.sas.com/pubs or call 1-800-727-3228. SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. IBM® and all other International Business Machines Corporation product or service names are registered trademarks or trademarks of International Business Machines Corporation in the USA and other countries. Oracle® and all other Oracle Corporation product or service names are registered trademarks of Oracle Corporation in the USA and other countries. Other brand and product names are trademarks of their respective companies.

Contents

Preface v

Acknowledgments ix

1 Discrete Distributions 11.1 Introduction 11.2 The Binomial Distribution 21.3 The Poisson Distribution 81.4 The Multinomial Distribution 111.5 Negative Binomial and Negative Multinomial Distributions 12

2 Basic Log-Linear Models and the GENMOD Procedure 192.1 Introduction 192.2 Log-Linear Models for a 2 × 2 Table 192.3 Log-Linear Models in Higher Dimensions 302.4 Residuals for Log-Linear Models 382.5 Tests of Statistical Significance 402.6 The Likelihood Function 45

3 Ordered Categorical Variables 533.1 Introduction 533.2 Log-Linear Models with One Ordered Category 533.3 Two Cross-Classified Ordered Categories 59

4 Non-Rectangular Tables 694.1 Introduction 694.2 Independence in a Triangular Table 694.3 Interactions in a Circular Table 724.4 Bradley-Terry Model for Pairwise Comparisons 79

5 Poisson Regression 855.1 Introduction 855.2 Poisson Regression for Mortality Data 875.3 Poisson Regression with Overdispersion 92

iv Contents

6 Finite Population Size Estimation 1016.1 Introduction 1016.2 A Small Example 1016.3 A Larger Number of Lists 105

7 Truncated Poisson Regression 1117.1 Introduction 1117.2 Mathematical Background 1127.3 Truncated Poisson Models with Covariates 1177.4 An Example with Overdispersion 1197.5 Diagnostics and Options 121

8 The Hypergeometric Distribution 1298.1 Introduction 1298.2 Derivation of the Distribution 1318.3 Extended Hypergeometric Distribution 1368.4 Hypergeometric Regression 1398.5 Comparing Several 2 × 2 Tables 144

9 Sample Size Estimation and Power for Log-Linear Models 1499.1 Introduction 1499.2 Background Theory 1499.3 Power for a 2 × 2 Table 1549.4 Sample Size for an Interaction 1619.5 Power for a Known Sample Size 167

A The Output Delivery System 173

B Programming Statements for Generalized Linear Models 177

C Additional Readings 181

References 183

Index 185

Preface

This book describes applications of log-linear models that use the GENMOD procedure inSAS to solve problems that arise in the statistical analysis of categorical data. Categoricalor frequency data comes about when integer counts of individuals classified into a rela-tively small number of discrete outcomes are described. This is in contrast to continuousmeasures for which there is a different set of models, measures of fit, and diagnostics. Theversatility and flexibility offered by GENMOD enables this book to describe a variety ofmodels and applications that can sometimes be very different from those described in theSAS documentation for this procedure.

The GENMOD procedure is an implementation of the generalized linear model method-ology made popular by McCullagh and Nelder (1989). Generalized linear models, as theirname implies, are an extremely wide class of models that include most of the commonstatistical models in use today. These include models for continuous valued data, but onlythe models for discrete, categorical data are described in this book.

You should be familiar with basic SAS programming, such as how to use the DATAstep and some of the elementary statistical procedures such as FREQ and MEANS. Anintroduction to SAS in Chapters 1–3 in Cody and Smith (1997) should be more than ade-quate. You can access lengthier programs from the companion Web site for this book andmodify them. Typically, software is best learned by starting with some existing code andmodifying it. Documentation can be referred to for specific details after a basic knowledgeof the software has already been attained.

Although the subject matter of this book is similar to that of Categorical Data Anal-ysis Using the SAS System, Second Edition by Stokes, Davis, and Koch (2001), the ma-terial presented here includes a wider variety of sampling and log-linear models. Stokeset al. provide a more general introduction to the analysis of categorical data. For a usefulintroduction to logistic regression refer to Logistic Regression Using the SAS System byAllison (1999).

This book is not intended to act as an introductory text to statistics or SAS programming.Rather, the aim of this book is analogous to what Alan Cantor (1997) does for advancedtechniques in survival analysis in Extending SAS Survival Analysis Techniques for MedicalResearch, Second Edition. You should be comfortable with the basic analysis of categoricaldata and be ready to try something further afield.

A large number of examples are included and are used to motivate the theory and meth-ods as they are introduced. The many examples in this book are principally drawn frommedicine, health, and environmental sciences because this is my background. Of course,the methods and programs given here can be applied to other disciplines as well.

You should have taken at least one elementary statistics course and be familiar withsome of the basic language and tools. These statistical topics will generally be defined asthey are introduced, but to fully appreciate the narrative you should already be comfort-able with their use. You’ll need to understand the rules of probability including conditionalprobability and independence. The mathematical and sample definitions of mean, variance,and correlation are used. You should be familiar with the binomial, Poisson, normal, andchi-squared distributions. The Pearson goodness of fit chi-squared test is featured promi-

vi Preface

nently throughout the text. The use and application of this important method should befamiliar as the test for independence of rows and columns in a 2 × 2 table. The languageof hypothesis testing includes significance level, critical region, and power. These topicsare reviewed in Chapter 9 where the topic of sample size estimation is discussed. A nod-ding acquaintance with some of the basics of linear regression with normal errors is usefulas well. Several more advanced topics in likelihood-based inference are included in Sec-tion 2.6 but can be skipped on a first reading.

The first chapter provides a description of several sampling distributions used in thestudy of categorical data. These begin with the binomial and Poisson sampling distribu-tions. You should be familiar with some of the basic properties of these two importantsampling models. Additional features such as sums of independent Poisson and binomialrandom variables are discussed and will be needed in subsequent chapters. The multinomialand negative multinomial are two multivariate discrete distributions. These multivariatedistributions might be unfamiliar to you so their properties are described. Two additionaldiscrete distributions are introduced in Chapters 7 and 8.

Chapter 2 begins with log-linear models and progresses into diagnostics and maximumlikelihood estimation. A variety of examples are given of log-linear models along with therelevant GENMOD code and output. All of these are explained in the context of two nu-merical examples. The diagnostics produced by the OBSTATS option in PROC GENMODare described in Chapter 2.

Section 2.6 describes the likelihood function. You can omit this section without a loss ofcontinuity. This section provides an additional level of understanding of the mathematicsbeing performed in the GENMOD procedure.

Chapter 3 gives models for ordered categories. The examples provide a case study ofhow to use the CLASS statement in GENMOD. There are circumstances where the CLASSstatement is beneficial and others where it pays for you to build new variables to suit theproblem. Chapter 3 contains an example of statistical modeling of a single ordered categoryand an example containing two cross-classified ordered categorical variables.

A number of non-rectangular tables are studied in Chapter 4. These include data whoseorganization does not fit the traditional shape of categorical data. Interesting examplesamong these include a triangular and even a circular table. Some of these make use of thenatural ordering of the categories and use the ideas developed in Chapter 3.

Poisson regression is discussed in Chapter 5. This model uses the Poisson distributionto model the behavior of observations similar to the way in which the normal distribution isused in linear regression. A traditional use of this method is illustrated by describing a dataset of cancer deaths in Japan. An example of Poisson regression with many covariates isthe estimation of species diversity in the Galapagos archipelago. This last example exhibitsoverdispersion and is used to illustrate the negative binomial distribution.

The estimation of a finite population size is described in Chapter 6 using the log-linearmodels developed in Chapter 2. These methods are often used to estimate the size of aclosed, finite population. The most common applications of these methods appear in mark-recapture wildlife studies, but they are also useful in epidemiologic investigations.

Chapter 7 uses GENMOD to model a truncated Poisson distribution. The truncatedPoisson distribution is closely related to the usual Poisson except that the ‘zero’ countsare not observable. A regression model is developed along the lines of the methods de-scribed in Chapter 5. An example of this sampling model includes a sample of towns withlottery winners where a portion of the data has not been reported. The truncated Poissondistribution also has applications in estimating the size of a finite population of illegal im-migrants in the Netherlands. The truncated Poisson distribution is not an available optionin the GENMOD procedure but can be programmed. The GENMOD and macro code isprovided.

The hypergeometric distribution is derived in Chapter 8. This distribution has typicallybeen used in modeling 2×2 tables and exact tests of significance. The output of the FREQprocedure is explained in the context of this distribution. Additional programs and macrosare provided to fit this distribution in GENMOD. In particular, a hypergeometric regression

Preface vii

model is introduced in which the log-odds ratio parameter is modeled as a linear functionof covariates. This model has uses in case/control studies and an example of this type ofdata is examined.

Chapter 9 describes the problems of estimating a sample size and power for planningpurposes. These methods are placed in the context of log-linear models and represent animportant topic that should be of interest to a wide audience. Chapter 9 reviews the statis-tical language of hypothesis testing such as significance level and power. The non-centralchi-squared distribution is also introduced in Chapter 9. The power of the deviance andPearson chi-squared statistics are expressed in terms of the non-central chi-squared distri-bution. Two examples and the SAS code are provided to illustrate the process of estimatinga sample size for log-linear models.

Appendix A describes some of the Output Delivery System that can be used to cap-ture and control the output of GENMOD and other SAS procedures, more generally. Ap-pendix B includes a section of the mathematical framework of generalized linear modelsthat makes up the basis of the GENMOD procedure. Two examples and details are givenfor programming additional new models in GENMOD.

All of the programs and data sets described here are available on the companion Website for this book at www.sas.com/companionsites. Select the book title to display its Com-panion Web Site, then select Example Code to display the SAS programs from this book.

viii

Acknowledgments

I am grateful to Dr. John C. Marsh for a useful discussion of the physiology of a strokeand to Dr. Emile Solloum for explaining various diseases of the lung to me. Comments onvarious sections of this manuscript were provided by Anne Chao, Chuck Davis, and Petervan der Heijden. Assistance was also provided by Roslyn Cameron of the Charles DarwinResearch Station on Santa Cruz, in the Galapagos Islands. I am extremely grateful to ChangYu and Walter Stroup who provided a thorough, line-by-line critique on early drafts of thiswork. Any problems that remain are solely my responsibility. Special thanks are due toJudy Whatley, David Scholtzhauer and everybody else I interacted with at SAS Institutefor their help and encouragement throughout. Special mention is due to Lora Delwiche andSusan Slaughter. Their Little SAS Book anticipated many questions I had with SAS. FinallyI thank my wife Linda, who endured my hours and kept me on task.

Daniel ZeltermanNew Haven, CT

Summer 2002

x

Chapter

1Discrete Distributions

1.1 Introduction 11.2 The Binomial Distribution 21.3 The Poisson Distribution 81.4 The Multinomial Distribution 111.5 Negative Binomial and Negative Multinomial Distributions 12

1.1 Introduction

Generalized linear models cover a large collection of statistical theories and methods thatare applicable to a wide variety of statistical problems. The models include many of thestatistical distributions that a data analyst is likely to encounter. These include the normaldistribution for continuous measurements as well as the Poisson distribution for discretecounts. Because the emphasis of this book is on discrete count data, only a fraction ofthe capabilities of the powerful GENMOD procedure are used. The GENMOD procedureis a flexible software implementation of the generalized linear model methodology thatenables you to fit commonly encountered statistical models as well as new ones, such asthose illustrated in Chapters 7 and 8.

You should know the distinction between generalized linear models and log-linear mod-els. These two similar sounding names refer to different types of mathematical models forthe analysis of statistical data. A generalized linear model, as implemented with GEN-MOD, refers to a model for the distribution of a random variable whose mean can be ex-pressed as a function of a linear function of covariates. The function connecting the meanwith the covariates is called the link. Generalized linear models require specifications forthe link function, its inverse, the variance, and the likelihood function. Log-linear modelsare a specific type of generalized linear model for discrete valued data whose log-means areexpressible as linear functions of parameters. This discrete distribution is often assumed tobe the Poisson distribution. Chapters 7 and 8 show how log-linear models can be extendedto distributions other than Poisson and programmed in the GENMOD procedure.

This chapter and Chapter 2 develop the mathematical theory behind generalized linearmodels so that you can appreciate the models that are fit by GENMOD. Some of this mate-rial, such as the binomial distribution and Pearson’s chi-squared statistic, should already befamiliar to those of you who have taken an elementary statistics course, but it is includedhere for completeness.

This chapter introduces several important probability models for discrete valued data.Some of these models should be familiar to you and only the most important featuresare emphasized for the binomial and Poisson distributions. The multinomial and negativemultinomial distributions are multivariate distributions that are probably unfamiliar to mostof you. They are discussed in Sections 1.4 and 1.5.

All of the discrete distributions presented in this chapter are closely related. Each iseither a limit or a generalization of another. Some can be obtained as conditional or special

2 Advanced Log-Linear Models Using SAS

cases of another. A unifying feature of all of these distributions is that when their meansare modeled using log-linear models, then their fitted means will coincide. Specifically, allof these distributions can easily be fit using the GENMOD procedure.

A brief review of the Pearson chi-squared statistic is given in Section 2.2. More ad-vanced topics, such as likelihood based inference, are also included in Chapter 2 to providea better appreciation of the GENMOD output. The statistical theory behind the likelihoodfunction of Section 2.6 is applicable to continuous as well as discrete data, but only thediscrete applications are emphasized.

Two additional discrete distributions are derived in later chapters. Chapter 7 derives atruncated Poisson distribution in which the ‘zero’ frequencies of the usual Poisson distri-bution are not recorded. A truncated Poisson regression model is also developed in Chap-ter 7 and programmed with GENMOD. Two forms of the hypergeometric distribution arederived in Chapter 8 and they are also fitted using GENMOD code provided in that chapter.A more general reference for these and other univariate discrete distributions is Johnson,Kotz, and Kemp (1992).

1.2 The Binomial Distribution

The binomial distribution is one of the most common distributions for discrete or countdata. Suppose there are N (N ≥ 1) independent repetitions of the same experiment, eachresulting in a binary valued outcome, often referred to as success or failure. Each experi-ment is called a Bernoulli trial with probability p of success and 1 − p of failure wherethe value of parameter p is between zero and one.

Let Y denote the random variable that counts the number of successes following Nindependent Bernoulli trials. A useful example is to let Y count the number of the headsobserved in N coin tosses, for which p = 1/2. (An example in which Y is the number ofinsects killed in a group of size N exposed to a pesticide is discussed as part of Table 1.1below.) The valid range of values for Y is 0, 1, . . . , N . The random variable Y is said tohave the binomial distribution with parameters N and p. The parameter N is sometimesreferred to as the sample size or the index of the distribution.

The probability mass function of Y is

Pr[Y = j] =(

N

j

)p j (1 − p)N− j

where j = 0, 1, . . . , N . A plot of this function appears in Figure 1.1 where N = 10 andthe values of p are .2, .5, and .8.

The binomial coefficients are defined by(N

j

)= N !

j ! (N − j)!with 0! = 1. Read the binomial coefficient as: ‘N choose j’.

The binomial coefficients count the different orders in which the successes and failurescould have occurred. For example, in N = 4 tosses of a coin, 2 heads and 2 tails could haveappeared as HHTT, TTHH, HTHT, THTH, THHT, or HTTH. These 6 different orderingsof the outcomes can also be counted by(

4

2

)= 4!

2! 2! = 6 .

The expected number of successes is N p and the variance of the number of successes isN p(1 − p). The variance is smaller than the mean. A symmetric distribution occurs whenp = 1/2. When p > 1/2 the binomial distribution has a short right tail and a longer left

Chapter 1 Discrete Distributions 3

tail. Similarly, when p < 1/2 this distribution has a longer right tail. These shapes fordifferent values of the p parameter can be seen in Figure 1.1. When N is large and p is nottoo close to either zero or one, then the binomial distribution can be approximated by thenormal distribution.

Figure 1.1The binomial distribution of Y where N = 10 and the values of parameter p are .2, .5, and .8.

0 2 4 6 8 100

.2

.4

Pr[Y ] p = .2

0 2 4 6 8 100

.2

.4

Pr[Y ] p = .5

0 2 4 6 8 100

.2

.4

Pr[Y ] p = .8

One useful feature of the binomial distribution relates to sums of independent binomialcounts. Let X and Y denote independent binomial counts with parameters (N1, p1) and(N2, p2) respectively. Then the sum X + Y also behaves as binomial with parametersN1 + N2 and p1 only if p1 = p2. This makes sense if one thinks of performing the sameBernoulli trial N1 + N2 times.

This characteristic of the sum of two binomial distributed counts is exploited in Chap-ter 8 where the hypergeometric distribution is derived. The hypergeometric distribution is


that of Y conditional on the sum X + Y . If p1 �= p2 then X + Y does not have a binomialdistribution or any other simple expression. Section 8.4 discusses the distribution of X +Ywhen p1 �= p2.

The remainder of this section on the binomial distribution contains a brief introductionto logistic regression. Logistic regression is a popular and important method for providingestimates and models for the p parameter in the binomial distribution. A more lengthy dis-cussion of this technique gets away from the GENMOD applications that are the focus ofthis book. For more details about logistic regression, refer to Allison (1999); Collett (1991);Stokes, Davis, and Koch (2001, chap. 8); and Zelterman (1999, chap. 3).

The following example demonstrates how the binomial distribution is modeled in prac-tice using the GENMOD procedure. Consider the data given in Table 1.1. In this table sixbinomial counts are given and the problem is to mathematically model the p parameter foreach count. Table 1.1 summarizes an experiment in which each of six groups of insectswere exposed to a different dose xi of a pesticide. The life or death of each individual in-sect represents the outcome of an independent, binary-valued (success or failure) Bernoullitrial. The number Ni in the i th group was fixed by the experimenters (i = 1, . . . , 6). Thenumber of insects that died Yi in the i th group has a binomial distribution with parametersNi and pi .

TABLE 1.1 Mortality of Tribolium castaneum beetles at six different concentrations of the insecticideγ -benzene hexachloride. Concentrations are measured in log10(mg/10 cm2) of a 0.1% film. Source:Hewlett and Plackett, 1950.

Concentration xi

1.08 1.16 1.21 1.26 1.31 1.35

Number killed yi 15 24 26 24 29 29Number in group Ni 50 49 50 50 50 49Fraction killed .300 .490 .520 .480 .580 .592

Fitted linear .350 .427 .475 .523 .572 .610Fitted logit .353 .427 .475 .524 .572 .610Fitted probit .352 .427 .475 .524 .572 .610

The statistical problem is to model the binomial probability pi of killing an insect inthe i th group as a function of the insecticide concentration xi . Intuitively, the pi shouldincrease with xi but notice that the empirical rates in the ‘Fraction killed’ row of Table 1.1are not monotonically increasing. A greater fraction are killed at the x = 1.21 pesticidelevel than at the 1.26 level. There is no strong biological theory to suggest that the modelfor the binomial probabilities pi is anything other than a monotone function of the dosexi . Beyond the requirement that pi = p(xi ) be a monotone function of the dose xi there isno mathematical form that must be followed, although some functions are generally betterthan others as you will see.

A simple approach is to model the binomial probabilities p(xi ) as linear functions ofthe dose. That is

pi = p(xi ) = α + β xi

as in the usual model with linear regression. As you will see, there are much better choicesthan a linear model for describing binomial probabilities.

Program 1.1 fits this linear model for the binomial probabilities with the MODEL state-ment in the GENMOD procedure:

model y/n=dose / dist=binomial link=identity obstats;

The notation y/n is the way that the index Ni is specified as corresponding to each bino-mial count Yi . The dist=binomial specifies the binomial distribution to the GENMOD


procedure. The link=identity produces a linear model of the binomial p parameter.The OBSTATS option prints a number of useful statistics that are more fully described inChapter 2. Among the statistics produced by OBSTATS are the estimates of the linear fit-ted pi parameters that are given in Table 1.1. Output 1.1 provides the estimated parametersfor the linear model of p. The estimated parameter values with their standard errors areα = −0.6923 (SE = 0.3854) and β = 0.9648 (SE = 0.3128).

This example fits the linear, logistic, and probit models to the insecticide data ofTable 1.1. Some of the output from this program is given in Output 1.1. In general, logisticregression should be performed in the LOGISTIC procedure.

Program 1.1 title1 ’Beetle mortality and pesticide dose’;data beetle;

input y n dose;label

y = ’number killed in group’n = ’number in dose group’dose = ’insecticide dose’ ;

datalines;15 50 1.0824 49 1.1626 50 1.2124 50 1.2629 50 1.3129 49 1.35

run;

proc print;run;

title2 ’Fit a linear dose effect to the binomial data’;proc genmod;model y/n=dose / dist=binomial link=identity obstats;

run;

title2 ’Logistic regression’;proc genmod;model y/n=dose / dist=binomial obstats;

run;

title2 ’Probit regression’;proc genmod;

model y/n=dose / dist=binomial link=probit obstats;run;

The following is selected output from Program 1.1.


Output 1.1 Fit a linear dose effect to the binomial data

The GENMOD Procedure

Analysis Of Parameter Estimates

Parameter DF EstimateStandard

Error

Wald 95%Confidence

Limits Chi-Square Pr > ChiSq

Intercept 1 -0.6923 0.3854 -1.4476 0.0630 3.23 0.0724

dose 1 0.9648 0.3128 0.3516 1.5779 9.51 0.0020

Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

Logistic regression




Error

Wald 95%Confidence


Intercept 1 -4.8098 1.6210 -7.9870 -1.6327 8.80 0.0030

dose 1 3.8930 1.3151 1.3153 6.4706 8.76 0.0031

Scale 0 1.0000 0.0000 1.0000 1.0000


Probit regression




Error

Wald 95%Confidence


Intercept 1 -3.0088 1.0054 -4.9793 -1.0383 8.96 0.0028

dose 1 2.4351 0.8158 0.8362 4.0340 8.91 0.0028

Scale 0 1.0000 0.0000 1.0000 1.0000


The α and β parameters of the linear model are fitted by GENMOD using maximumlikelihood, a procedure described in more detail in Section 2.6. Maximum likelihood is amore general method for estimating parameters than the method of least squares, which youmight already be familiar with from the study of linear regression. Least square estimationis the same as maximum likelihood for modeling data that follows the normal distribution.


The problem with modeling the binomial probability p as a linear function of the dosex is that for some extreme values of x the probability p(x) might be negative or greaterthan one. While this poses no difficulty in the present data example, there is no protectionoffered in another setting where it might result in substantial computational and inter-pretive problems. Instead of linear regression, the probability parameter of the binomialdistribution is usually modeled using the logit, or logistic, transformation.

The logit is the log-odds of the probability

logit(p) = log{p/(1 − p)} .

(Logs are always taken base e = 2.718 . . .)The logistic model specifies that the logit is a linear function of the risk factors. In the

present example, the logit is a linear function of the pesticide dose

log{p/(1 − p)} = µ + θ x (1.1)

for parameters (µ, θ) to be estimated. When θ is positive, then larger values of x corre-spond to larger values of the binomial probability parameter p.

Solving for p as a function of x in Equation 1.1 gives the equivalent form

p(x) = exp(µ + θ x)/ {1 + exp(µ + θ x)} .

This logistic function p(x) always lies between zero and one, regardless of the valueof x . This is the main advantage of logistic regression over linear regression for the pparameter. The fitted function p(x) for the beetle data is a curved form plotted in Figure 1.2.

Figure 1.2Fitted logistic, probit (dashed line), and linear regression models for the data given inTable 1.1. The � marks indicate the empirical mortality rates at each of the six levels ofconcentration of the insecticide.

0.5 1 1.5 20

.5

1

�

� � ��p(x)

Concentration x

logit probit

probit logit

linear

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.................................................................

............................................

.....................................

................................

...................................................................................................................................................................................................................................................................................................................................

....................................

........................................

..........................................................

.......................................................................................................

............................................................................

............. ............. ............. ............. ............. ............. ............. ..........................

..........................

..........................

..........................

...........................................................................................

..........................

..........................

..........................

.......................... ............. ............. ............. ............. ............. .............

The logistic regression model for p(x) is fitted by GENMOD in Program 1.1 using thestatements

proc genmod;model y/n=dose / dist=binomial obstats;

run;


The GENMOD procedure fits p(x) by estimating the values of parameters µ and θ inEquation 1.1. There is no need to specify the LINK= option here because the logit linkfunction and logistic regression are the default for binomial data in GENMOD. The fittedvalues of p(x) are given in Table 1.1 and are obtained by GENMOD using maximumlikelihood. The estimated parameter values for Equation 1.1 are given in Output 1.1. Theseare µ = −4.8098 (SE = 1.6210) and θ = 3.8930 (SE = 1.3151).

Another popular method for modeling p(x) is called the probit or sometimes, probitregression. The probit model assumes that p(x), properly standardized, takes the functionalform of the cumulative normal distribution. Specifically, for regression coefficients γ andξ to be estimated, probit regression is the model

p(x) =∫ γ+ξ x

−∞φ(t) dt

where φ(·) is the standard normal density function. If ξ is positive, then larger values of xcorrespond to larger values of p(x).

The probit model is specified in Program 1.1 using link=probit. The fitted values anda portion of the output appear in Output 1.1. The estimated parameter values for the probitmodel are γ = −3.0088 (SE = 1.0054) and ξ = 2.4351 (SE = 0.8158).

The fitted models for the linear, probit, and logistic models are plotted in Figure 1.2.The empirical rates for each of the six different dose levels are indicated by ‘�’ marks inthis figure. All three fitted models are in close agreement and are almost coincident in therange of the data. Beyond the range of the data the linear model can fail to maintain thelimits of p between zero and one. The fitted probit and logistic curves are always betweenzero and one regardless of the values of the dose x .

The probit and logistic models will generally be in close agreement except in the ex-treme tails of the fitted models. If the logit and probit models are extrapolated beyond therange of this data then the logit model usually has longer tails than does the probit. Thatis, the logit will tend to provide larger estimates than the probit model for p(x) when pis much smaller than 1/2. The converse is also true for p > 1/2. Of course, it is impos-sible to tell from this data which of the logit or probit models is correct in the extremetails or whether they are appropriate at all beyond the range of the data. This is a dangerof extrapolating beyond the range of the observed data that is common to all statisticalmethods.

The LOGISTIC procedure in SAS is specifically designed for performing logistic re-gression. The statements

proc logistic;model y/n = dose / iplots influence;

run;

are the parallel to the logistic GENMOD code in Program 1.1. The LOGISTIC procedurealso has options to fit the probit model. In general practice, logistic and probit regressionsshould be performed in the LOGISTIC procedure because of the large number of special-ized diagnostics that LOGISTIC offers through the use of the IPLOTS and INFLUENCEoptions.

1.3 The Poisson Distribution

Another important discrete distribution is the Poisson distribution. This distribution hasseveral close connections to the binomial distribution discussed in the previous section.


The Poisson distribution with mean parameter λ > 0 has the mass function

P[Y = j] = e−λ λ j/j ! (1.2)

and is defined for j = 0, 1, . . . .

The mean and variance of the Poisson distribution are both equal to λ. That is, the meanand variance are equal for the Poisson distribution, in contrast to the binomial distributionfor which the variance is smaller than the mean. This feature is discussed again in Sec-tion 1.5 where the negative binomial distribution is described. The variance of the negativebinomial distribution is larger than its mean.

The probability mass function (Equation 1.2) of the Poisson distribution is plotted inFigure 1.3 for values .5, 2, and 8 of the mean parameter λ. For small values of λ, most of

Figure 1.3The Poisson distribution where the values for λ are .5, 2, and 8.

0 2 40

.3

.6

Pr[Y ] λ = .5

0 2 4 6 8 100

.2

.4

Pr[Y ] λ = 2

0 5 10 15 200

.1

.2

Pr[Y ] λ = 8


the probability mass of the Poisson distribution is concentrated near zero. As λ increases,both the mean and variance increase and the distribution becomes more symmetric. Whenλ becomes very large, the Poisson distribution can be approximated by the normal distri-bution.

Models for Poisson data can be fit in GENMOD using dist=Poisson in the MODELstatement. Examples of modeling Poisson distributed data make up most of the material inChapters 2 through 6. The Poisson distribution is a good first choice for modeling discreteor count data if little is known about the sampling procedure that gave rise to the observeddata. Multidimensional, cross-classified data is often best examined assuming a Poissondistribution for the count in each category. Examples of multidimensional, cross-classifieddata appear in Sections 2.2 and 2.3.

The most common derivation of the Poisson distribution is from the limit of a binomialdistribution. If the binomial index N is very large and p is very small such that the binomialmean, N p, is moderate, then the Poisson distribution with λ = N p is a close approximationto the binomial. As an example of this use of the Poisson model, consider the distributionof the number of lottery winners in a large population. This example is examined in greaterdetail in Sections 2.6.2 and 7.5. The chance (p) of any one ticket winning the lottery isvery small but a large number of lottery tickets (N ) are sold. In this case the number oflottery winners in a city should have an approximately Poisson distribution.

Another common example of the Poisson distribution is the model for rare diseases ina large population. The probability (p) of any one person contracting the disease is verysmall but many people (N ) are at risk. The result is an approximately Poisson distributednumber of cases appearing every year. This reasoning is the justification for the use of thePoisson distribution in the analysis of the cancer data described in Section 5.2.

Methods for fitting models of Poisson distributed data using GENMOD and log-linearmodels are given in Chapters 2 through 6 and are not described here. Chapter 2 coversmost of the technical details for fitting and modeling the mean parameter of the Poissondistribution to data. Chapters 3 through 6 provide many examples and programs. A specialform of the Poisson distribution is developed in Chapter 7. In this distribution, only thepositive values (that is, 1, 2, . . .) of the Poisson variate are observed. The remainder of thissection provides useful properties of the Poisson distribution.

The sum of two independent Poisson counts also has a Poisson distribution. Specifically,if X and Y are independent Poisson counts with respective means λX and λY , then the sumX +Y is a Poisson distribution with mean λX +λY . This feature of the Poisson distributionis useful when combining rates of different processes, such as the rates for two differentdiseases.

In addition to the Poisson distribution being a limit of binomial distributions, there isanother close connection between the Poisson and binomial distributions. If X and Y areindependent Poisson counts, as above, and the sum of X + Y = N is known, then theconditional distribution of Y is binomial with index N and the probability parameter

p = λY / (λX + λY ) .

This connection between the Poisson and binomial distributions can lead to some con-fusion. It is not always clear whether the sampling distribution represents two independentcounts or a single binomial count with a fixed sample size. Does the data provide one de-gree of freedom or two? The answer depends on which parameters need to be estimated.In most cases the sample size N is either estimated by the sum of counts or is taken as aknown, constrained quantity. In either case this constraint represents a loss of a degree offreedom. That is, whenever you are counting degrees of freedom after estimating param-eters from the data, treat the data as binomial whether the constraint of having exactly Nobservations was built into the sampling or not. Log-linear models with an intercept, forexample, will obey this constraint.


1.4 The Multinomial Distribution

The two discrete distributions described so far are both univariate, or one-dimensional.In the previous two sections you saw that independent Poisson or independent binomialdistributions are convenient models for discrete values data. There are also multivariatediscrete distributions. Multivariate distributions are useful for modeling correlated counts.Two such multivariate distributions are described below.

Two useful multivariate discrete distributions are the multinomial and the negativemultinomial distributions. These two distributions allow for negative and positive depen-dence among the discrete counts respectively.

An important and useful feature of these two multivariate discrete distributions is thatlog-linear models for their means can be fitted easily using GENMOD and are the sameas those obtained assuming independent Poisson counts. In other words, the estimatedexpected counts for these discrete univariate and multivariate distributions can be obtainedusing dist=Poisson in GENMOD. The interpretation and the variances of these samplingmodels can be very different, however.

The multinomial distribution is the generalization of the binomial distribution to morethan two discrete outcomes. Suppose each of N individuals can be independently classi-fied into one of k (k ≥ 2) distinct, non-overlapping categories with respective probabilitiesp1, . . . , pk . The non-negative pi (i = 1, . . . , k) sum to one. Of the N individuals so cat-egorized, the probability that n1 fall into the first category, n2 in the second, and so on,is

Pr[n1, . . . , nk | N, p] = N ! pn11 pn2

2 · · · pnkk

/n1! n2! · · · nk !

where n1 + · · · + nk = N . This is the probability mass function of the multinomial distri-bution.

An example of the multinomial distribution is the frequency of votes for office cast fora group of k candidates among N voters. Each pi represents the probability that any onerandomly selected person chooses candidate i . The i th candidate receives ni votes. If morevoters choose one candidate, then there will be fewer votes for each of the other candidates.The joint collection of frequencies ni of votes for the candidates are mutually negativelycorrelated because of the constraint that there are

∑ni = N voters.

The multinomial distribution models counts that are negatively correlated. This is usefulwhen the total sample size is constrained and a large count in one category is associatedwith smaller counts in all of the other cells. The negative multinomial distribution, de-scribed in the following section, is useful when all of the counts are positively correlated.A positive correlation might be useful for the data of Table 1.2, for example, for modelingdisease rates in a city where a large number of individuals with one type of cancer wouldbe associated with high rates in all other types as well.

When k = 2 the multinomial distribution is the same as the binomial distribution. Anyone multinomial frequency ni behaves marginally as binomial with parameters N and pi .Similarly, each ni has mean N pi and variance N pi (1 − pi ). Any pair of multinomialfrequencies has a negative correlation:

Corr(ni , n j ) = − {pi p j/(1 − pi )(1 − p j )

}1/2 (1.3)

The constraint that all multinomial frequencies ni sum to N means that one unusuallylarge count causes all other counts to be smaller. A useful feature of the multinomial distri-bution is that fitted means in a log-linear model are the same as those as if you sampledfrom independent Poisson distributions.

There is a close connection between the Poisson and the multinomial distributions thatparallels the relationship between the Poisson and binomial distributions. Let X1, . . . , Xk

denote independent Poisson random variables with positive mean parameters λ1, . . . , λkrespectively. The distribution of the counts X1, . . . , Xk conditional on their sum


N = ∑Xi is multinomial with parameters N and p1, . . . , pk where

pi = λi/(λ1 + · · · + λk) .

This close connection between the multinomial and Poisson distributions should helpexplain why the estimated means are the same for both sampling models. The negativemultinomial distribution, described next, also shares this property.

A small numerical example is in order. In 1866, Gregor Mendel reported his theoryof genetic inheritance and gave the following data to support his claim. Out of 529 gar-den peas, he observed 126 dominant color; 271 hybrids; and 132 with recessive color. Histheories indicate that these three genotypes should be in the ratio of 1 : 2 : 1. The ex-pected counts corresponding to these genotypes are then 529/4 = 132.25 dominant color;529/2 = 264.5 hybrids; and 132.25 recessive color.

The sampling distribution is uncertain and several different lines of reasoning can beused to justify various models. In one sampling scenario, Mendel must have examined alarge group of peas and this sample was only limited by his time and patience. That is,his total sample size (N ) was not constrained. Each of the three counts was independentlydetermined, as was the total sample size. In this case the counts are best described by threeindependent Poisson distributions.

In a second reasoning for the appropriate sampling distribution, note that it is impossibleto directly observe the difference between the dominant color and a hybrid. Instead, theseplants must be self-crossed and examined in the following growing season. Specifically,the ‘grandchildren’ of the pure dominant will all express that characteristic but those ofthe hybrids will exhibit both the dominant and recessive traits. In this sampling scheme,Mendel might have given a great deal of thought to restricting the sample size N to amanageable number. Using this reasoning, a multinomial sampling distribution might bemore appropriate, or perhaps, the total number of dominant combined with the hybrid peasshould be modeled separately as a binomial experiment.

Finally, note that the determination of pure dominant versus hybrid can only be as-certained as a result of the conditions during the following two growing seasons, whichwill depend on those years’ weather. All of the counts reported may have been greater orsmaller, but in any case, would all be positively correlated. In this setting the negative bi-nomial sampling model described in the following section may be the appropriate modelfor this data.

In each of these three sampling models (independent Poisson, multinomial, or negativemultinomial) the expected counts are the same as given above. Test statistics of goodnessof fit will also be the same since these are only a function of the observed and expectedcounts. The interpretations of the variances and correlations of the counts are very different,however.

1.5 Negative Binomial and Negative Multinomial Distributions

The most common derivation of the negative binomial distribution is through the binomialdistribution. Consider a sequence of independent, identically distributed, binary valuedBernoulli trials each resulting in success with probability p and failure with probability1 − p. An example is a series of coin tosses resulting in heads and tails, as described inSection 1.2. The binomial distribution describes the probability of the number of successesand failures after a fixed number of these experiments have been conducted.

The negative binomial distribution describes the behavior of the number of failures ob-served before the cth success has occurred for a fixed, positive, integer-valued parameter c.That is, this distribution measures the number of failures observed until c successes havebeen obtained. Unlike the binomial distribution, the negative binomial distribution doesnot have a finite range. In particular, if the probability of success p is small, then a very


large number of failures will appear before the cth success is obtained. If X is the negativebinomial random variable denoting the number of failures before the cth success, then

Pr[X = x] =(

x + c − 1

c − 1

)pc(1 − p)x (1.4)

where x = 0, 1, . . . . The binomial coefficient in Equation 1.4 reflects that c + x total trialsare needed and the last of these is the cth success that ends the experiment.

The expected value of X in the negative binomial distribution is

EX = c(1 − p)/p

and the variance of X satisfies

VarX = c(1 − p)/p2 = E(X)/p .

The most useful feature of this distribution is that the variance is larger than the mean.In contrast, the binomial variance is smaller than its mean, and the Poisson variance isequal to its mean. Another important feature to note when you are contrasting these threedistributions is that the binomial distribution has a finite range but the Poisson and negativebinomial distributions both have infinite ranges.

In the more general case of the negative binomial distribution, it is not necessary torestrict the c parameter to integer values. The generalization of Equation 1.4 to any positivevalued c parameter is

Pr[X = x] = c(c + 1) · · · (c + x − 1)pc(1 − p)x/x ! (1.5)

where x = 1, 2, . . . and Pr[X = 0] = pc.The estimation of the c parameter in Equation 1.5 is generally a difficult task and should

be avoided if at all possible. Traditional methods such as maximum likelihood either tendto fail to converge to a finite value or tend to produce huge confidence intervals for theestimated value of the variance of X . The likelihood function for log-linear models andrelated estimation methods are discussed in Section 2.6. GENMOD offers dist=nb in theMODEL statement to fit the negative binomial distribution. This option is used in a dataanalysis in Section 5.3.

The narrative below suggests a simple method for estimating the c parameter and pro-ducing confidence intervals. The negative binomial distribution behaves approximately asthe Poisson distribution for large values of c in Equation 1.5. An explanation for the largeconfidence intervals in estimates of c is that the Poisson distribution often provides an ad-equate fit for the data. An example of this situation is given in the analysis of the data inTable 1.2.

TABLE 1.2 Cancer deaths in the three largest Ohio cities in 1989. The body sites of the primary tumorare as follows: oral cavity (1); digestive organs and colon (2); lung (3); breast (4); genitals (5); urinaryorgans (6); other and unspecified sites (7); leukemia (8); and lymphatic tissues (9). Source: NationalCenter for Health Statistics (1992, II, B, pp. 497–8); Waller and Zelterman (1997).

Primary cancer site

City 1 2 3 4 5 6 7 8 9

Cleveland 71 1052 1258 440 488 159 523 169 268Cincinnati 52 786 988 270 337 133 378 107 160Columbus 41 518 715 190 212 91 254 77 137

Used with permission: International Biometric Society.

The negative binomial distribution is often described as a mixture of Poisson distribu-tions. If the Poisson mean parameter varies between observations then the resulting distri-


bution will have a larger variance than that of a Poisson distribution with a fixed parameter.More details of the derivation of the negative binomial distribution as a gamma-distributedmixture of Poisson distributions are given by Johnson, Kotz, and Kemp (1992, p. 204).

There are methods for separately modeling the means and the variances of data withGENMOD using the SCALE parameter. The SCALE parameter might be used, for ex-ample, to model Poisson data for which the variances are larger than the means. The set-ting in which variances are larger than what is anticipated by the sampling model is calledoverdispersion. Fitting a SCALE parameter with GENMOD is one approach to modelingoverdispersion. Using the VARIANCE statement in GENMOD is another approach and isillustrated in Chapters 7 and 8.

A useful multivariate generalization of the negative binomial distribution is the nega-tive multinomial distribution. In this multivariate discrete distribution all of the counts arepositively correlated. This is a useful feature for settings such as models for longitudinalor spatially correlated data.

An example to illustrate the property of positive correlation is given in Table 1.2. Thistable gives the number of cancer deaths in the three largest cities in Ohio for the year 1989listed by primary tumor. If one type of cancer has a high rate within a specified city, thenit is likely that other cancer rates are elevated as well within that city. We can assume thatthe overall disease rates may be higher in one city than another but these rates are notdisease specific. That is, the relative frequencies of the various cancer death rates do notvary across cities. The counts of the various cancer deaths between cities are independentbut are positively correlated within each city.

Let X = {X1, . . . , Xk} denote a vector of negative multinomial random variables. Anexample of such a set X is the joint set of cancer frequencies for any single city in Table 1.2.The joint probability of X taking the non-negative integer values x = {x1, . . . , xk} is

Pr[X = x] = c(c + 1) · · · (c + x+ − 1)

(c

c + µ+

)c k∏i=1

(µi

c + µ+

)xi/

xi ! (1.6)

where x+ = ∑xi . In Equation 1.6, µ+ = ∑

µi is used for the sum of the mean pa-rameters µi > 0. Unlike the multinomial distribution, the observed sample size x+ is notconstrained.

The expected value of each Xi in Equation 1.6 is µi . When k = 1, the negative multi-nomial distribution in Equation 1.6 coincides with the negative binomial distribution inEquation 1.5 with parameter value p = c/(c + µ+). The marginal distribution of each Xi

in the negative multinomial distribution has a negative binomial distribution. The varianceof each negative multinomial count Xi is

VarXi = µi (1 + µi/c) ,

which is larger than the mean.The correlation between any pairs of negative multinomial counts Xi and X j where

i �= j is

Corr(Xi ; X j

) =(

µi µ j

(c + µi )(c + µ j )

)1/2

(1.7)

These correlations are always positive. Contrast this statement with the correlations be-tween multinomial counts at Equation 1.3, which are always negative. When the parameterc in the negative multinomial distribution becomes large, then the correlations in Equa-tion 1.7 are close to zero. Similarly, for large values of c, the negative multinomial countsXi behave approximately as independent Poisson observations with respective means µi .Infinite estimates or confidence interval endpoints of the c parameter are indicative of anadequate fit for the independent Poisson model. An example of this setting is given below.

Estimation of the mean parameters µi is not difficult for the negative multinomial distri-bution in Equation 1.6. Waller and Zelterman (1997) show that the maximum likelihood


estimated mean parameters µi for the negative multinomial distribution are the same asthose for independent Poisson sampling. In other words, dist=Poisson in the MODELstatement of GENMOD will fit Poisson, multinomial, and negative multinomial mean pa-rameters, and all of these estimates coincide.

A method for estimating the c parameter is motivated by the behavior of the chi-squaredgoodness of fit statistic. The chi-squared statistic is the readily familiar measure of good-ness of fit from any elementary statistics course. A discussion of its use is given at Equa-tion 2.4 in Section 2.2 where it is used in a log-linear model. Chapter 9 describes the useof chi-squared in sample size and power estimation for planning purposes.

The usual chi-squared statistic

χ2 =∑

i

(xi − µi )2/µi

will suffer from the overdispersion of the negative binomial distributed counts xi whichtend to have variances that are larger than their means. As a result, the chi-squared statisticwill tend to be larger than is anticipated by the corresponding asymptotic distribution.

The approach taken by GENMOD is to use the SCALE = P or PSCALE options to es-timate the amount of overdispersion by the ratio of the chi-squared to its df. The numeratorof chi-squared represents the empirical variance for the data. (There is also a correspond-ing SCALE = D or DSCALE option to rescale all variances using the deviance statis-tic, described at Equation 2.5, Section 2.2.) Section 5.3 examines a data set that exhibitsoverdispersion and illustrates the scale option in GENMOD.

In most settings, the expected value of the chi-squared statistic is equal to its df underthe correct model for the means, without overdispersion. If the value of chi-squared is toolarge relative to its df, then values of the ratio chi-squared/df that are much greater thanone provide evidence that the empirical variance of the data is an appropriate multiple ofthe mean given in the denominator of the chi-squared statistic.

Another test statistic for these overdispersed, negative multinomial distributed data, anda measure of the degree of overdispersion, replaces the denominators with their appropri-ately modeled larger variances. The test statistic for negative multinomial distributed datais

χ2(c) =∑

i

(xi − µi )2 / [µi (1 + µi/c)] (1.8)

where the denominators are replaced by the negative multinomial variances. Varying thevalues of c in χ2(c) and matching the values of this statistic with the corresponding asymp-totic chi-squared distribution provides a simple method for estimating the c parameter. Anapplication of the use of Equation 1.8 appears in Section 7.4. There are similar methodsproposed by Williams (1982) and Breslow (1984).

The rest of this section discusses the example in Table 1.2 and demonstrates how to useEquation 1.8 to estimate the overdispersion parameter. Consider the model of independenceof rows and columns in Table 1.2. This model specifies that the relative rates for the variouscancer deaths are the same for each of the three cities. Let xrs denote the number of cancerdeaths of disease site r in city s. The expected counts µrs for xrs are

µrs = xr+x+s / N (1.9)

where x+s and xr+ are the row and column sums, respectively, of Table 1.2.The µrs in Equation 1.9 are the usual estimates of counts for testing the hypothesis of

independence of rows and columns. These estimates should be familiar from any elemen-tary statistics course and are discussed in more detail in Section 2.2. The observed value ofchi-squared is 26.96 (16 df) and has an asymptotic significance level of p = .0419, whichindicates a poor fit, assuming a Poisson model is used for the counts in Table 1.2.


The c parameter can be estimated as follows. The median of a chi-squared randomvariable with 16 df is 15.34. Solving Equation 1.8 with

χ2(c) = 15.34

for c yields the estimated value of c = 466.9. Solving this equation is not specific toGENMOD and can easily be performed using an iterative program or spreadsheet.

The point estimate of c = 466.9 is that value of the c parameter that equates the teststatistic χ2(c) to the median of its asymptotic distribution. The corresponding fitted cor-relation for the city of Cincinnati is given in Table 1.3. The values in this table combinethe expected counts µrs in the correlations of the negative multinomial distribution givenat Equation 1.7 and use the estimate c = 466.9. An important property of the negativemultinomial distribution is that all of these correlations are positive.

TABLE 1.3 Estimated correlation matrix of cancer types in Cincinnati using the fitted negativemultinomial model with an estimated value of c equal to 466.9.

Disease 1 2 3 4 5 6 7 8 9

1 1.00 0.25 0.26 0.20 0.21 0.15 0.21 0.14 0.172 1.00 0.65 0.49 0.51 0.36 0.53 0.35 0.423 1.00 0.51 0.53 0.38 0.55 0.36 0.444 1.00 0.40 0.29 0.41 0.28 0.335 1.00 0.30 0.43 0.29 0.346 1.00 0.31 0.21 0.257 1.00 0.30 0.358 1.00 0.249 1.00

A symmetric 90% confidence interval for the 16 df chi-squared distribution is(7.96; 26.30) in the sense that outside this interval there is exactly 5% area in both theupper and lower tails. Separately solving the two equations

χ2(c) = 7.96 and χ2(c) = 26.30

gives a 90% confidence interval of (125.8; 18,350) for the c parameter. Such wide confi-dence intervals are to be expected for the c parameter. An intuitive explanation for thesewide intervals is that the independent Poisson sampling model (c = +∞) almost holds forthis data.

The symmetric 95% confidence interval for a 16 df chi-squared is (6.91; 28.85). Solvingfor c in the two separate equations

χ2(c) = 6.91 and χ2(c) = 28.85

yields the open-ended interval (101.1;+∞). This unbounded interval occurs because thevalue of χ2(c) can never exceed the value of the original χ2 = 26.96 regardless of howlarge c becomes. That is, there is no solution in c to the equation χ2(c) = 28.85. We caninterpret the infinite endpoint in the interval to mean that a Poisson model is part of the95% confidence interval. A 90% confidence interval indicates that the data is adequatelyexplained by a negative multinomial distribution. Another application of Equation 1.8 toestimate overdispersion appears in Section 7.4. A statistical test specifically designed totest Poisson versus negative binomial data is given by Zelterman (1987).

The chi-squared statistic used in the data of Table 1.2 can have two different interpre-tations: as a test of the correct model for the means of the counts; and also as a test foroverdispersion of these counts. The usual application for the chi-squared statistic is to testthe correct model for the means modeled at Equation 1.9, specifying that the disease rates


of the various cancers is the same across the three large cities. The use of the χ2(c) statis-tic here is to model the overdispersion or inflated variances of the counts. The chi-squaredstatistic then has two different roles: testing the correct mean and checking for overdisper-sion. In the present data it is almost impossible to separate these different functions.

Two additional discrete distributions are introduced in Chapters 7 and 8. The Poissondistribution is, by far, the most popular and important of the sampling models described inthis chapter. The first two sections of Chapter 2 show how the Poisson distribution is thebasic tool for modeling categorical data with log-linear models.

18

Chapter

2Basic Log-Linear Models andthe GENMOD Procedure

2.1 Introduction 192.2 Log-Linear Models for a 2 × 2 Table 192.3 Log-Linear Models in Higher Dimensions 302.4 Residuals for Log-Linear Models 382.5 Tests of Statistical Significance 402.6 The Likelihood Function 45

2.1 Introduction

Log-linear models are a broad class of methods that enable you to describe the effectsof and interactions between the various factors in multidimensional categorical data. Youmight already have some familiarity with these models. Logistic regression is an exampleof a common log-linear model. This important technique is briefly introduced in Sec-tion 1.1.

This chapter covers the basics as well as the more mathematical aspects of log-linearmodels. Section 2.2 introduces the basics of log-linear models and a larger example ispresented in Section 2.3. Section 2.4 describes different types of residuals. Section 2.5discusses methods for computing significance levels that are associated with log-linearmodels. Section 2.6 develops the likelihood function and its properties. Refer to Stokes,Davis, and Koch (2001, chap. 16) or Zelterman (1999, chap. 4) for additional examplesand details.

2.2 Log-Linear Models for a 2 × 2 Table

This section begins with an example of a log-linear model and shows how it is programmedin GENMOD. A model of independence in a 2×2 table is a simple example of a log-linearmodel. The data in Table 2.1 summarizes an experiment in which 16 mice were exposed tothe fungicide Avadex and 79 were kept separately under usual care. After 85 weeks all 95mice were examined by a pathologist for tumors in their lungs (Innes et al. 1969; Plackett1981, p. 23). This example is interesting because, at .05, the relationship between exposureand tumor outcome is just barely statistically significant.

The most common model to begin the study of this data is that of independence ofexposure and tumor development. If two events are independent, then their joint probabilityis equal to the product of their individual probabilities. In this case

Pr [exposure and tumor] = Pr[exposure] × Pr[tumor] = 16

95× 9

95(2.1)


TABLE 2.1 Incidence of tumors in mice exposed to the fungicide Avadex (Innes et al. 1969).

Exposed Control Totals

Mice with tumors 4 5 9No tumors 12 74 86

Totals 16 79 95

Public domain: Journal of the National Cancer Institute, Oxford University Press.

The expected proportion of exposed mice with tumors is then the exposed proportion(16/95) times the proportion who developed tumors (9/95). Under the model of inde-pendence, you multiply this expected proportion by the sample size (95) to estimate thenumber of the exposed mice expected to develop cancer.

95 × 16

95× 9

95= (16 × 9)/95 = 1.5158 (2.2)

This is the familiar form seen in elementary statistics. Specifically, the expected count is theproduct of the row and column sums divided by the sample size. In Table 2.1 the observednumber of exposed mice with tumors is 4. If exposure to the fungicide is independent oftumor development, then one would expect 1.5158 exposed mice to develop cancer, basedon the marginal sums of Table 2.1. The full set of expected counts for all mice is given inTable 2.2.

TABLE 2.2 Expected counts m for mice exposed to the fungicide under the model of independence ofexposure and tumor development. The data observed appears in Table 2.1.


Mice with tumors 1.516 7.484 9No tumors 14.484 71.515 86

Totals 16 79 95

In this example, let ni j (i = 1, 2; j = 1, 2) denote the observed count in the (i, j)categorical cell with row sums ni+ and column marginal sums n+ j . That is,

n1+ =∑

j

n1 j = n11 + n12 = 4 + 5 = 9

and n2+ = 86. The marginal column sums are n+1 = 16 and n+2 = 79. The total samplesize, which is 95 in this example, will be denoted by either N, or n++.

The mean of the count ni j is denoted by mi j . The values of the means are rarely knownin advance and usually have to be estimated from the observed data. Under the model ofindependence the estimate mi j of the expected count mi j in the (i, j) cell can be written as

mi j = ni+n+ j/N . (2.3)

This is a more general notation of the example given at Equation 2.2. The full set of allfour expected counts is given in Table 2.2.

Chapter 2 Basic Log-Linear Models and the GENMOD Procedure 21

You should understand two points about the expected counts given in Table 2.2. First, allof the marginal sums ni+ and n+ j are retained in Tables 2.1 and 2.2. That is, the marginalsums of the estimated counts (mi j ) coincide with those of the data (ni j ). So, for example,n1+ = 16 mice were exposed to the fungicide and the sum of the estimated means m1+also estimates 16 exposed mice.

Second, any one of the four observed or expected counts determines the other three.Even though there are four counts in this 2 × 2 table, the restrictions on the row andcolumn sums place constraints on the way in which the expected counts can be filled in. Ifany one observed or expected count is specified in this table, then the corresponding countsin the same row and column are determined by the marginal sums. This recognition thatone count determines all others gives rise to the expression “one degree of freedom” or1 df.

Program 2.1 uses the GENMOD procedure to fit the log-linear model of independencein three different ways. Output 2.1 gives a portion of the output from Program 2.1. All threeGENMOD procedures contain this identical material. The three different approaches do,however, produce some unique values and interpretations of the parameters. The differentapproaches are compared at Section 2.2.1 below.

Program 2.1 options linesize=80 center pagesize=54;

ods trace on;ods printer file=’c:\Table2-1.ps’ ps;ods printer select Genmod.ModelFit;ods listing select Genmod.Modelfit;

data;input count expos tumor alpha beta;label

tumor = ’mice with tumors’expos = ’exposure status’alpha = ’row effects’beta = ’column effect’ ;

datalines;4 1 1 1 15 0 1 1 -1

12 1 0 -1 179 0 0 -1 -1run;

proc genmod; /* Model using the CLASS statement */class tumor expos;model count = tumor expos /dist = Poisson obstats;

run;

proc genmod; /* Mimic the CLASS statement */model count = tumor expos /dist = Poisson obstats;

run;

proc genmod; /* Row and column effects sum to zero */model count = alpha beta /dist = Poisson obstats;

run;

ods printer close;


The first two lines of Program 2.1 control the format of the output. The OPTIONS state-ment specifies the length and width of the printed page with PAGESIZE= and LINESIZE=,respectively. These same options control the size of the text in the Output window in yourSAS session. The OPTIONS statement also uses the CENTER option to specify that theoutput is to be centered (left to right) on each page.

The ODS statements at the beginning and end of Program 2.1 invoke the SAS OutputDelivery System, which places the output from the program into a PostScript or HTML filethat can be viewed in the Results window of your SAS session. The Results window has anindex that enables you to view a specific item within the listing, without having to scroll upor down to find it. All of the output tables in this book, such as Output 2.1, were preparedusing the output from the Output Delivery System to create PostScript files. Other optionsin the Output Delivery System enable you to save specific items of the output in SAS filesand then pass these files on to subsequent steps of your SAS program. Appendix A de-scribes some of the Output Delivery System capabilities that are specific to the GENMODprocedure.

Output 2.1 The GENMOD Procedure

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 1 4.2674 4.2674

Scaled Deviance 1 4.2674 4.2674

Pearson Chi-Square 1 5.4083 5.4083

Scaled Pearson X2 1 5.4083 5.4083

Log Likelihood 264.7784

The Pearson Chi-Square values in Output 2.1 are the well-known chi-squared mea-sure of fit.

χ2 =∑

i j

(ni j − mi j )2/mi j (2.4)

The Pearson chi-squared statistic should be familiar to you. This statistic tests whetherthe difference between the observed count and the expected count could have occurred bychance alone. It is also discussed in Chapter 9 with more details about the sample size andthe power of a prospective, planned experiment such as this one.

The Pearson chi-squared statistic has a large value when there are large differencesbetween the observed counts ni j and their estimated means mi j . If the value of the Pearsonchi-squared statistic is unusually large relative to its 1 df reference chi-squared distribution,then you should conclude that this difference between the observed and expected countscould not have occurred by chance alone. If this is the case, you should reject the model(or null hypothesis) under which the expected counts were estimated.

In the present example, the large value of chi-squared rejects the null hypothesis thatexposure to the fungicide is independent of the development of lung cancer. This statementdoes not imply that exposure to the fungicide causes cancer, but rather that cancer occursat different rates among the exposed and unexposed animals. Furthermore, the differencein these rates is unlikely to have occurred by chance alone. The interpretation of thesediffering rates is a subjective and difficult process that cannot be performed using statisticalmethods alone.


A common source of confusion is whether the chi-squared statistic is performing a one-or two-sided test. At first glance it seems that the chi-squared statistic is only rejected at theupper tail of the chi-squared distribution and consequently you might be tempted to thinkthat this is a one-sided test. Upon closer examination, however, notice how χ2 at Equa-tion 2.4 squares the differences of the observed and expected counts, thereby removing thesigns or directions of these differences. A two-sided test is being performed because countsthat are either above or below their expected values by the same amount are treated in thesame manner.

The Value/DF column in Output 2.1 is the manner in which GENMOD estimates theSCALE parameter for overdispersion. This method is also described in Section 1.5. In mostcases it is impossible to tell whether the variance of the sample is larger than the variancethat is anticipated by the Poisson distribution or if the model for the means is incorrect.This chapter uses chi-squared statistics to test the model for the mean. Models for datawith overdispersion are discussed again in Sections 5.3 and 7.4.

The Deviance statistic in Output 2.1 might be unfamiliar to you. The functional formof it is

G2 = 2∑

i j

ni j log(ni j/mi j ). (2.5)

The output of the FREQ procedure in SAS refers to this statistic as the likelihood ratio chi-square. The connection between the deviance statistic and the likelihood ratio is explainedin Section 2.6.

In most settings the deviance should be close in value to the chi-squared statistic un-less many of the estimated means m are small. In that case, the chi-squared statistics willgenerally have a better approximation to the theoretical chi-squared distribution. The exactmethods that avoid the issue of this approximation are described in Chapter 8.

The chi-squared and deviance statistics are both used to test the fit of the model ofindependence of fungicide exposure and the development of lung tumors. They have thesame reference chi-squared distribution with 1 df. More details about the derivation andmotivation for the deviance statistic are given in Sections 2.5 and 2.6.

A useful feature of the deviance statistic is its use in nested log-linear models. Twomodels are said to be nested if one contains a subset of the terms in the other. The differenceof the deviance statistic computed on nested models also behaves approximately the sameas chi-squared with df equal to the difference of the degrees of freedom for these twomodels. The Pearson chi-squared statistic does not share this property. This property ofdeviance in nested models is illustrated with a numerical example in Section 2.5. The teststatistics that result from the application of this property of deviance are called the Type 1and Type 3 analyses in GENMOD. These tests are described in Section 2.5.2.

The Scaled Deviance and Scaled Pearson χ2 values are useful in modelingoverdispersion. An example of these statistics in practice appears in Section 5.3.

The Log Likelihood value in Output 2.1 is not a test statistic, but rather is related tothe Poisson distribution. This quantity is described in Section 2.6.

The OBSTATS option in Program 2.1 produces a useful set of summary statistics givenin Output 2.2 under the heading Observation Statistics. They can also be obtainedby using the RESIDUALS, PREDICTED, XVARS, and CL options in the GENMODMODEL statement. These statistics are useful for identifying outliers and assessing thefit of the model at individual categorical cell counts.


Output 2.2


Observation Statistics

Observation Resraw Reschi Resdev StResdev StReschi Reslik

1 2.4842105 2.0177574 1.6716586 1.9266748 2.325572 2.0325803

2 -2.484211 -0.908062 -0.966874 -2.476191 -2.325572 -2.34916

3 -2.484211 -0.652741 -0.672876 -2.397307 -2.325572 -2.331303

4 2.4842105 0.2937565 0.2920799 2.3122995 2.325572 2.3253608

Residuals

Predicted



Observation count Pred Xbeta Std HessWgt

1 4 1.5157895 0.4159364 0.4038376 1.5157895

2 5 7.4842106 2.0127955 0.336516 7.4842106

3 12 14.484211 2.6730591 0.2521936 14.484211

4 74 71.515789 4.2699183 0.1173023 71.515789

X variables



Observation

micewithtumors

exposurestatus

1 1 1

2 1 0

3 0 1

4 0 0

Confidence limits



Observation Lower Upper

1 0.6868972 3.3449225

2 3.8699295 14.474013

3 8.8354217 23.744464

4 56.826915 90.00151


OBSTATS produces three types of residuals and their corresponding standard errors.Residuals are useful in detecting important deviations or lack of fit in models. The threeresiduals (Resraw, Reschi, Resdev) and the measures of goodness of fit for individ-ual data points are described in the Section 2.4 as diagnostics for log-linear models. Thestandardized Reschi and Resdev residuals are given by StReschi and StResdev respec-tively. Standardized residuals are residuals divided by their estimated standard errors toproduce comparable residuals, all with unit variances.

The count column in Output 2.2 gives the original counts ni j in each of the four cells ofthe table. The Pred values are the usual maximum likelihood estimates mi j given at Equa-tion 2.3 and Table 2.2. In other words, they are the estimated counts expected under themodel of independence. More details about the maximum likelihood estimation processthat is employed by GENMOD are given in Section 2.6.

The Xbeta values in Output 2.2 are the logs (base e) of the expected counts. That is,

Pred = exp(Xbeta).

The Std values are the estimated standard errors of these Xbeta log means. An approx-imate 95% confidence interval for each Xbeta is

Xbeta ± 1.96Std.

The value ±1.96 refers to the upper and lower .025 quantiles of the standard normal distri-bution.

Confidence limits for the estimated means can be obtained by converting back from thelog-scale in which the Xbeta values are given. The Lower and Upper values are approx-imate 95% confidence intervals for the estimated means (Pred values). These confidencelimits are calculated by GENMOD as

exp{Xbeta ± 1.96Std}.The HessWgt are Hessian weights needed for the estimation of model parameters. For

Poisson data, the Hessian weights are also the variances and are the same as the estimatedmeans. These values do not play a large role in statistical inference by themselves. Hessianweights are used in the estimating procedure of generalized linear models by GENMOD.Section 2.6 discusses parameter estimation in more detail.

2.2.1 Parameterization of the Model of Independence

Program 2.1 shows three different methods for fitting and testing the log-linear model ofindependence with the data in Table 2.1. The first method uses the CLASS statement;the second method mimics the CLASS statement, using indicator variables; and the thirdmethod uses identifiability restrictions. All three programs produce the same GENMODoutput in Outputs 2.1 and 2.2, but the parameter estimates are all different in numericalvalue and interpretation.

This section describes three different approaches to the parameterization of this log-linear model of independence. Log-linear models can easily be applied to more compli-cated examples than Equation 2.1 but this is a good place to start. Log-linear models are apopular and convenient way of describing the means mi j of the counts ni j . Notice the mul-tiplicative nature of Equations 2.1 and 2.3. Products become additive when the logarithmis taken.

The log-linear model corresponding to Equations 2.1 and 2.3 is

log mi j = µ + αi + β j (2.6)


where µ acts as an intercept, and αi and β j model the log-mean counts in the differentrows and columns, respectively. A full discussion of the interpretation of the parameters inEquation 2.6 is given at Equations 2.7 and 2.10 below.

The problem with writing Equation 2.6 in this form is that there are five parameters onthe right (one µ, two α’s, and two β’s) but only four observations. Such a model is saidto be over parameterized, or not identifiable, or super-saturated. The problem of havingmore parameters than data can be avoided by placing restrictions on the values of the α’sand β’s.

One possible method (but by no means the only one) to avoid this over parameterizationis to set

α2 = 0 and β2 = 0. (2.7)

This is the method used with the CLASS statement in GENMOD or other ANOVA-based SAS regression procedures including MIXED and GLM. The first approach in Pro-gram 2.1 uses this method. Output 2.3 contains a portion of the GENMOD output that isunique to this approach. The second tumor (row) and expos (column) effects are set tozero along with their estimated standard errors and confidence intervals. They can there-fore be interpreted as reference categories because all rows and columns will be com-pared to them. The use of a reference category and the CLASS statement is illustrated inSection 3.3. The Estimate column gives the estimated value of each parameter and theStandard Error column gives an estimate of its standard error.




Error

Wald 95%Confidence


Intercept 1 0.4159 0.4038 -0.3756 1.2074 1.06 0.3030

tumor 0 1 2.2571 0.3503 1.5705 2.9438 41.51 <.0001

tumor 1 0 0.0000 0.0000 0.0000 0.0000 . .

expos 0 1 1.5969 0.2742 1.0595 2.1342 33.93 <.0001

expos 1 0 0.0000 0.0000 0.0000 0.0000 . .

Scale 0 1.0000 0.0000 1.0000 1.0000


The values in the Chi-Square column are calculated as

Chi-Square = (Estimate/Standard Error)2 (2.8)

and provide a test of the null hypothesis that the true (population) parameter being esti-mated is equal to zero. This statistic is sometimes called the Wald chi-squared (in PROCLOGISTIC output, for example) and is illustrated as part of Figure 2.1 in Section 2.6.

The upper and lower Wald 95% Confidence Limits are calculated as

Estimate ± 1.96 Standard Error. (2.9)

The parameter estimates and their standard errors are obtained from the likelihood functionwhich is explained in Section 2.6.


Under the null hypothesis that the true parameter value is equal to zero, the chi-squaredstatistic will usually behave as a 1 df chi-squared random variable and its statistical sig-nificance is calculated in the Pr > ChiSq column. These significance levels in Output 2.3demonstrate a large difference in the column and row frequencies of the data in Table 2.1,but say nothing about the relationship between the fungicide exposure and the subsequentdevelopment of lung cancer.

The Intercept is the log (base e) of the expected number of exposed mice with tumors.That is,

log(16 × 9/95) = log(1.5158) = .4159

which is the same as the logarithm of the value obtained at Equation 2.2. The categori-cal cell for exposed mice with tumors is the reference cell for this parameterization. Theestimated standard error of the intercept and the chi-squared test of a zero intercept inOutput 2.3 are not useful in the present setting. These statistics for the intercept can beimportant in different settings, however, and are used in examples presented in Chapter 6.

The tumor and expos parameters model the marginal sums of the data in Table 2.1.These parameters compare the first row and column to the second (reference) row andcolumn, respectively. Specifically,

tumor 0 = log(86/9) = 2.2571

and

expos 0 = log(79/16) = 1.5969

are the estimated parameter values in Output 2.3.These parameter estimates model the marginal sums of the data but tell nothing about

the interaction effect between exposure to the fungicide and the subsequent tumor de-velopment. The strength of the relationship between exposure and tumor development ismeasured with the chi-squared statistics of Output 2.1. As with the intercept, estimatedstandard errors and chi-squared tests for these parameter values are reported by GENMODbut are not particularly useful in the present example.

In this example you can switch the reference categories to avoid over parameteriza-tion. The output for this model, which mimics the CLASS statement model, appears inOutput 2.4. In this second parameterization the parameters are restricted according to

α1 = 0 and β1 = 0.




Error

Wald 95%Confidence


Intercept 1 4.2699 0.1173 4.0400 4.4998 1325.03 <.0001

tumor 1 -2.2571 0.3503 -2.9438 -1.5705 41.51 <.0001

expos 1 -1.5969 0.2742 -2.1342 -1.0595 33.93 <.0001

Scale 0 1.0000 0.0000 1.0000 1.0000



Compare these restrictions on the parameter values to those given at Equation 2.7. The ex-posure and tumor parameters have reversed signs but their standard errors and chi-squaredstatistics have the same values as in Output 2.3.

The reason that the signs are reversed is because of the manner in which the CLASSstatements build the indicators for exposure and tumor development. The indicator valuesof 0 and 1 are sorted and the 1 appears after the 0. (The ORDER= option in the PROCGENMOD statements can be used to reverse this.) The CLASS indicators set the parame-ter value of the last category to zero so the 1 status is the reference category. That is, theregression coefficient for this CLASS variable represents the 0 level and the level 1 coeffi-cient is set to zero. Of course, this statement will only hold for factors that have only twolevels and are coded as 0 and 1.

If you were to regress directly on the 0 and 1 indicators, then the 0 category wouldbecome the reference category and the regression coefficient would reflect the 1 level. Thisis how the indicator variables can be used to mimic the CLASS statement in SAS. TheChi-Square and Pr>Chi-Sq values for tumor and exposure indicators are the same inOutputs 2.3, 2.4, and 2.5.

Another popular technique for removing the over parameterization in Equation 2.6 is torequire that the two α’s and the two β’s both sum to zero. The third approach in Program2.1 uses this method. In SAS this is often called effect coding. That is,

α1 + α2 = 0 and β1 + β2 = 0. (2.10)

The output from the parameterization given at Equation 2.10 is given in Output 2.5.




Error

Wald 95%Confidence


Intercept 1 2.3429 0.1974 1.9561 2.7297 140.94 <.0001

alpha 1 -1.1286 0.1752 -1.4719 -0.7852 41.51 <.0001

beta 1 -0.7984 0.1371 -1.0671 -0.5298 33.93 <.0001

Scale 0 1.0000 0.0000 1.0000 1.0000


If you sum over all four categorical cells in Equation 2.6 and use the coding of equa-tion 2.1, then you see that the intercept for this log-linear model will satisfy

µ = 1/4∑

i j

mi j .

When you replace the cell means mi j by their estimates mi j given at Equation 2.3, thenyou have the estimate

µ = 1/2(log n1+ + log n2+ + log n+1 + log n+2) − log N = 2.3429

which is listed as the Intercept in Output 2.5.


Since alpha = α1 = −α2 and beta = β1 = −β2, there is no need to produce estimatesfor two row and column parameters. The fitted row parameter is

alpha = 1/2(log n1+ − log n2+) = (1/2) log(9/86) = −1.1286

and the column parameter is

beta = 1/2(log n+1 − log n+2) = (1/2) log(16/79) = −.7984.

As with the other two parameterizations, the row and column effects here are functionsof the marginal sums of the data and tell us nothing about the relationship between tumordevelopment and exposure to the fungicide.

The factors of 1/2 in these expressions for alpha and beta explain why estimates for α

and β in Output 2.5 are exactly half of those obtained in Outputs 2.3 and 2.4. The estimatedstandard errors of α and β are also half of those seen in the previous two outputs so that thechi-squared values for these row and column effects are identical in all three approaches.

To summarize, there are three different but equivalent parameterizations and ways to fitlog-linear models in GENMOD. The CLASS statement can be used directly or its featurescan be duplicated using binary valued indicator variables. A third approach uses differentidentifiability constraints on the parameters in Equation 2.6. The model parameters havedifferent estimated values in each of these three methods but the fitted means m and mea-sures of goodness of fit are the same in every case. This is an important point because oneis usually concerned with the fit of the model, which, in turn, is only a function of thesefitted means m and not of the values of the parameters (µ, αi , β j ).

2.2.2 The Odds Ratio

Equation 2.6 is useful for describing counts when fungicide exposure is independent ofthe development of lung cancer in mice. How can you model the lack of independencebetween these two factors if you believe that they are related? In other words, is there asimple measure you can use to describe the dependence of the factors on each other? Themeasure you are looking for is called the odds ratio and is explained below. Odds ratios fitwell into the framework of log-linear models because they are multiplicative in nature.

In Table 2.1 there were 16 mice exposed to the fungicide. Of these 16 mice, 4 developedtumors and 12 did not. The odds of developing cancer are

4/12 = .333

This ratio represents the number of mice in identical conditions that developed cancer overthe number of those that did not.

Among the 79 unexposed mice, 5 developed cancer and 74 did not. The odds of devel-oping cancer in these control mice is then

5/74 = .06757.

Notice that these two odds are not equal. Their difference can be attributed to the in-creased risk of cancer that the exposed mice experience. The best way to describe thedifference of these two odds is not through their difference, but rather, through their ratio.

Specifically, the odds ratio is the ratio of the two odds. In the present example, the oddsratio

(4/12)/(5/74) = 4.933

represents the increased risk associated with the exposed mice over the risk of cancer intheir unexposed counterparts.


You can then say that exposed mice are almost 5 times as likely to develop cancer. Thissingle number provides a magnitude and direction for the risk associated with exposure.Odds ratios greater than 1 indicate increased risk and odds ratios less than 1 indicate adecreased risk. Under the model of independence of exposure and tumor development youwould expect to see an odds ratio close to 1 in value.

Goodness of fit measures such as chi-squared or deviance do not indicate the directionof the dependence. That is, a large value of either of these statistics indicates a dependenceof the factors in a 2 × 2 table but does not tell you whether fungicide exposure increases ordecreases the cancer risk.

You can also write the odds ratio in the equivalent expression

(4/12)/(5/74) = (4 × 74)/(12 × 5).

In this expression, notice how the diagonal entries of the original table of frequenciesare multiplied. This leads many authors to refer to the odds ratio as the cross-product ratio.

The odds ratio estimated here is an intuitive and simple measure of dependence. Thereare many other measures of dependence provided in the FREQ procedure. Some of theseare discussed in Chapter 8. Section 3.3 develops a log-linear model that accounts for oddsratios that change across several 2 × 2 tables.

It is often convenient to talk about the logarithm of the odds ratio rather than the oddsratio itself. In this example, the log odds ratio is

log(4.9333) = 1.5960.

The log odds ratio in the present example has an approximate standard error of

{1

4+ 1

12+ 1

5+ 1

74

}1/2

= √.5468 = .7395

You can use this value to create confidence intervals using the approximate normaldistribution of the log odds ratio. These intervals are valid under the null hypothesis thatthe odds ratio of the population is 1; or equivalently, the log odds is 0. Such an interval canbe used as another test of fit for the model of independence.

In the present example, the log odds ratio has a 95% confidence interval of

1.5960 ± 1.96 × .7395

or (0.147; 3.05).You can convert this confidence interval for the log odds ratio back to the odds ratio.

The corresponding 95% confidence interval for the odds ratio has endpoints

exp(0.147) = 1.158 and exp(3.05) = 21.0.

This interval does not include 1 and provides you with statistical evidence of an associ-ation between fungicide exposure and tumor development. Mice exposed to the fungicideare at greater risk of developing lung cancer than those not exposed.

2.3 Log-Linear Models in Higher Dimensions

The true value of log-linear models is evident when they are applied to problems in-volving multidimensional data. The previous section presents a simple example of atwo-dimensional table of frequencies. It introduces the basic notation and implementationin GENMOD for such tables. This section shows how higher dimensional, multifactortables can be examined using log-linear models.


The data of Table 2.3 is presented by Wermuth (1976) and McCullagh and Nelder (1989,p. 190). This table gives the frequencies of infant vitality in a group of German womenwho were either pregnant for the first time or had experienced complications in earlierpregnancies. The data is cross-classified by four binary valued factors: gestational age,maternal age, smoking habit, and infant vitality status (whether or not the infant lives).

TABLE 2.3 Infant vitality frequencies by maternal age, gestational age, and smoking habit. Source:Wermuth (1976); McCullagh and Nelder (1989, p. 190).

Gestational Mother’s Cigarettes Perinatal Liveage (days) age (years) smoked/day mortality births

197–260 < 30 ≤ 5 50 3156+ 9 40

30+ ≤ 5 41 1476+ 4 11

261+ < 30 ≤ 5 24 40126+ 6 459

30+ ≤ 5 14 14946+ 1 124

Public domain: Material not copyright protected.

One approach to the analysis of this data is to use logistic regression to model the riskfactors for infant vitality. Such an approach is useful if you are only interested in the singleoutcome of vitality. For example, you might want to know if smoking or maternal age aresignificant risk factors in determining infant vitality.

On the other hand, the log-linear approach can be used to measure the strength of thesimultaneous relationships between all of the risk factors. Log-linear models might reveala strong relationship between smoking and maternal age, for example, that would not bediscovered using the logistic regression approach.

Five different log-linear models are fitted by GENMOD in Program 2.2. The first ofthese is the model of mutual independence of the four factors: maternal age, gestationalage, smoking, and vitality. The second contains all six possible pairs of interactions be-tween these four factors. The third fits all possible three-way interactions between the fourfactors. This approach of examining models with parameter interactions of each order isa reasonable way to initiate an analysis of multidimensional categorical data. Another ex-ample of this approach is given in Section 6.3. The last two models contain a subset of allpossible pairwise interactions. In the fifth model every interaction effect included in themodel has at least a moderate statistical significance. A more detailed discussion of theselog-linear models is given in the remainder of this section.

Portions of the output of this program are given in Outputs 2.6 through 2.10.

Program 2.2 title1 ’Perinatal mortality’;data;

input gest age cigs vita count @@;label

gest = ’Gestational age 261+ days’age = ’Mother over 30 years’cigs = ’More than 6 cigarettes/day’vita = ’Live births’ ;


datalines;0 0 0 0 50 0 0 0 1 315 0 0 1 0 90 0 1 1 40 0 1 0 0 41 0 1 0 1 1470 1 1 0 4 0 1 1 1 11 1 0 0 0 241 0 0 1 4012 1 0 1 0 6 1 0 1 1 4591 1 0 0 14 1 1 0 1 1494 1 1 1 0 11 1 1 1 124

run;title2 ’Log-linear model of mutual independence’;proc genmod; /* Produces Output 2.6 */

model count = gest age cigs vita / dist = Poisson;run;title2 ’Log-linear model with all pairwise effects’;proc genmod; /* Produces Output 2.7 */

model count = gest | age | cigs | vita @2 / dist = Poisson;run;title2 ’Log-linear model with all 3-way interactions’;proc genmod; /* Produces Output 2.8 */

model count = gest | age | cigs | vita @3 / dist = Poisson;run;title2 ’A small log-linear model with 5 pairwise interactions’;proc genmod; /* Produces Output 2.9 */

model count = gest age cigs vita gest*age age*cigsgest*vita age*vita cigs*vita / dist = Poisson;

run;title2 ’A log-linear model with 4 pairwise interactions’;proc genmod; /* Produces Output 2.10 */

model count = gest age cigs vita gest*age age*cigsgest*vita age*vita / dist = Poisson;

run;

The first model considers the mutual independence of the four factors: gestational ageg; maternal age a; cigarette habit c; and fetal vitality v. The 0,1 valued indices (g, a, c, v)

are coded as follows: g = 0 is low gestational age; a = 0 is low maternal age; c = 0 is lowcigarette use; and v = 0 is perinatal death. Similarly, g = 1 is high gestational age, and soon for the other three variables in this data.

The functional form of the log-linear model of mutual independence is

log mgacv = µ + αg + βa + γc + δv. (2.11)

This interpretation is valid because log(mgacv) is a sum and mgacv is a product of func-tions of the four separate factors. It specifies that the probability of one woman simultane-ously experiencing the four factors (g, a, c, v) is

Pr[Gestation = g, Age = a, Cigarettes = c, Vitality = v]= Pr[Gestation = g] Pr[Age = a] Pr[Cigarettes = c] Pr[Vitality = v].

All of the MODEL statements in Program 2.2 regress directly on the binary valuedindicator variables (g, a, c, v) so the identifiability constraints in Equation 2.11 are

α0 = β0 = γ0 = δ0 = 0.

There are 16 categorical counts and 5 parameters in this model, leaving 11 df for thechi-squared statistic. The goodness of fit information from the GENMOD output for Equa-tion 2.11 appears in Output 2.6. The model of mutual independence has a poor fit to the


data: (χ2 = 808.5; 11 df; p< .0001). Therefore, it is not acceptable as a final model of thedata.




Deviance 11 374.1426 34.0130



Scaled Pearson X2 11 808.4906 73.4991


Output 2.7 summarizes the fit of the log-linear model containing all six possible pair-wise interactions

log mgacv = µ + αg + βa + γc + δv + εga + ζgc + ηgv + θac + κav + λcv. (2.12)




Deviance 5 1.8419 0.3684






Output 2.7(continued)




Error

Wald 95%Confidence


Intercept 1 3.9377 0.1267 3.6894 4.1861 965.71 <.0001

gest 1 -0.7461 0.1837 -1.1061 -0.3862 16.51 <.0001

age 1 -0.2822 0.1703 -0.6160 0.0516 2.75 0.0975

gest*age 1 -0.2225 0.0963 -0.4113 -0.0336 5.33 0.0209

cigs 1 -1.7222 0.2473 -2.2070 -1.2374 48.48 <.0001

gest*cigs 1 -0.0326 0.1490 -0.3246 0.2593 0.05 0.8266

age*cigs 1 -0.3551 0.0997 -0.5505 -0.1596 12.68 0.0004

vita 1 1.8176 0.1349 1.5531 2.0820 181.47 <.0001

gest*vita 1 3.2873 0.1847 2.9253 3.6493 316.76 <.0001

age*vita 1 -0.4816 0.1802 -0.8347 -0.1285 7.15 0.0075

cigs*vita 1 -0.4040 0.2614 -0.9164 0.1085 2.39 0.1223

Scale 0 1.0000 0.0000 1.0000 1.0000


Each of these pairwise interaction effects (εga, ζgc, etc.) fits a 1 df interaction betweena pair of factors. There are 11 parameters fitted in this model, each with 1 df, which leaves16 − 11 = 5 df for the chi-squared statistic. Equation 2.12 is most easily fitted using the@2 notation in the MODEL statement. Each interaction effect in Equation 2.12 is not anodds ratio, but can be interpreted as an average of odds ratios (Agresti 1990, pp. 144–5and Zelterman, 1999, pp. 127–8). Statistical tests for individual interactions are describedin detail in Section 2.5.

The fit of this model is very good: (χ2 = 1.86; 5 df; p = .87). Output 2.7 shows that allpairs of interactions are statistically significant except for those between high cigarette useand either gestational age or infant vitality. Both of these interactions can be removed fromthe log-linear model without compromising its goodness of fit. This is explained below.

There are four different methods for measuring the statistical significance of an inter-action between two variables. These are discussed more fully in Section 2.5 and are illus-trated there using this example. The Type 3 or partial association measure is described inSection 2.5.2 and is generally preferred. However, there are not large differences betweenthis approach and the other methods for measuring the statistical significance obtained inOutput 2.7.

A log-linear model with all possible three-way interactions is easily fitted using the @3notation in the MODEL statement. This option produces Output 2.8.





Deviance 1 0.4232 0.4232








Error

Wald 95%Confidence


Intercept 1 3.9017 0.1412 3.6249 4.1785 763.44 <.0001

gest 1 -0.7024 0.2420 -1.1768 -0.2281 8.43 0.0037

age 1 -0.1757 0.2074 -0.5821 0.2308 0.72 0.3970

gest*age 1 -0.4219 0.3777 -1.1622 0.3183 1.25 0.2639

cigs 1 -1.6489 0.3392 -2.3137 -0.9842 23.64 <.0001

gest*cigs 1 0.1519 0.5186 -0.8645 1.1683 0.09 0.7696

age*cigs 1 -0.8283 0.5583 -1.9225 0.2658 2.20 0.1379

gest*age*cigs 1 0.1384 0.3489 -0.5454 0.8222 0.16 0.6915

vita 1 1.8525 0.1517 1.5551 2.1499 149.08 <.0001

gest*vita 1 3.2452 0.2484 2.7583 3.7320 170.68 <.0001

age*vita 1 -0.5916 0.2288 -1.0401 -0.1431 6.68 0.0097

gest*age*vita 1 0.2018 0.3888 -0.5603 0.9640 0.27 0.6037

cigs*vita 1 -0.4293 0.3719 -1.1583 0.2996 1.33 0.2484

gest*cigs*vita 1 -0.2404 0.5340 -1.2869 0.8062 0.20 0.6526

age*cigs*vita 1 0.3632 0.5952 -0.8034 1.5298 0.37 0.5417

Scale 0 1.0000 0.0000 1.0000 1.0000


Three-way interactions are interpreted as changes in the odds ratios between differentstratum. An example of a model for changing odds ratios for different 2 × 2 tables is givenin Section 3.3. That example illustrates how log-linear models can be used to demonstratea linear trend in a series of odds ratios.

Three-way and higher interactions are often difficult to interpret in practice. Thesehigher order interactions sometimes appear with extreme statistical significance in sparsetables with small counts or are the result of a single unusually large or small count in one


categorical cell. Section 9.4 describes a three factor effect in terms of power and the samplesize needed to detect this interaction.

None of the three-way interaction effects in this log-linear model are statistically sig-nificant in Output 2.8. This fact and the good fit of Equation 2.12 indicate that the finalmodel should contain only a subset of all pairwise interactions and none of the three-wayinteractions.

The pairwise interaction ζgc between smoking habit and gestational age is not statisti-cally significant in Output 2.7. Thus, this interaction can be omitted from Equation 2.12.The corresponding log-linear model after omitting ζgc is

log mgacv = µ + αg + βa + γc + δv + εga + ηgv + θac + κav + λcv (2.13)

This model fits the data well (χ2 = 1.89, 6 df; p = .93), as seen in Output 2.9.




Deviance 6 1.8897 0.3149








Error

Wald 95%Confidence


Intercept 1 3.9392 0.1265 3.6913 4.1870 970.41 <.0001

gest 1 -0.7509 0.1824 -1.1084 -0.3934 16.95 <.0001

age 1 -0.2826 0.1703 -0.6164 0.0512 2.75 0.0971

cigs 1 -1.7323 0.2431 -2.2088 -1.2558 50.77 <.0001

vita 1 1.8193 0.1346 1.5554 2.0831 182.65 <.0001

gest*age 1 -0.2215 0.0962 -0.4101 -0.0329 5.30 0.0214

age*cigs 1 -0.3545 0.0997 -0.5498 -0.1591 12.65 0.0004

gest*vita 1 3.2886 0.1846 2.9268 3.6504 317.32 <.0001

age*vita 1 -0.4821 0.1801 -0.8352 -0.1291 7.16 0.0074

cigs*vita 1 -0.4241 0.2446 -0.9036 0.0554 3.01 0.0830

Scale 0 1.0000 0.0000 1.0000 1.0000



Since Equation 2.13 has one fewer parameter than Equation 2.12, the chi-squared statis-tic has one more df. The value of chi-squared has a negligible increase, indicating thatEquation 2.13 fits the data about as well as Equation 2.12.

The parameter with the least statistical significance level (that is, the largest p-value) inOutput 2.9 corresponds to the main-effect for maternal age (age): βa . This term could beremoved from Equation 2.13 but there are two reasons that this action is generally avoidedin practice.

First, there are several interactions with age that remain in the model. A model with aninteraction that lacks one or more of the lower-order effects is said to be non-hierarchical.A log-linear model is hierarchical if every interaction effect is accompanied with all pos-sible subset interactions. A pairwise interaction should also have the two separate maineffects included in the same model. A hierarchical model including the three-way inter-action for smoking, gestational, and maternal ages would also have to include each of thethree pairwise interactions between these effects as well as the three main effects. Non-hierarchical log-linear models can be fitted in GENMOD but their practical interpretationcan be difficult. An example of a non-hierarchical model is given by Worchester (1971)with discussion by Bishop et al. (1975, pp. 111–4) and Zelterman (1999, pp. 165–9).

Second, a log-linear model with all of its main effects is called comprehensive. If themain effect is removed from the log-linear model (regardless of its statistical significancelevel) then there will be some concern whether that factor should be completely eliminatedfrom the analysis altogether.

Returning to Output 2.9, the interaction λcv between smoking (cigs) and vitality (vita)is another good candidate to remove from the equation. Since the p-value for this parame-ter is .083, it is of modest statistical significance. A more detailed discussion of the signif-icance level of the λcv parameter, in particular, is given in Section 2.4.

The resulting log-linear model after removing λcv

log mgacv = µ + αg + βa + γc + δv + εga + ηgv + θac + κav (2.14)

still has a good fit: (χ2 = 5.442; 7 df; p = .61).The GENMOD output for Equation 2.14 is given in Output 2.10. This model has four

out of the six possible pairwise interactions and each of these interactions has a statisticalsignificance level of .0214 or less. This is the aim of good statistical modeling: a smallnumber of parameters, with each playing an important role in explaining the data well.The interpretation of Equation 2.14 is that gestational age, maternal age, and mortality allexhibit pairwise interactions. Cigarette smoking is only related to maternal age and notto the other two variables. An interpretation of Equation 2.14 by Darroch, Lauritzen andSpeed (1980) is that cigarette smoking habit is conditionally independent of both mortalityand gestational age given the maternal age. See Zelterman (1999, p. 128) for additionalinformation. Such graphical models and interpretations are the subject of the book by Lau-ritzen (1996).





Deviance 7 4.6174 0.6596








Error

Wald 95%Confidence


Intercept 1 3.9916 0.1212 3.7541 4.2291 1085.41 <.0001

gest 1 -0.7509 0.1824 -1.1084 -0.3934 16.95 <.0001

age 1 -0.2973 0.1698 -0.6301 0.0355 3.07 0.0799

cigs 1 -2.1474 0.0466 -2.2387 -2.0560 2122.28 <.0001

vita 1 1.7659 0.1296 1.5120 2.0199 185.80 <.0001

gest*age 1 -0.2215 0.0962 -0.4101 -0.0329 5.30 0.0214

age*cigs 1 -0.3470 0.0995 -0.5421 -0.1520 12.16 0.0005

gest*vita 1 3.2886 0.1846 2.9268 3.6504 317.32 <.0001

age*vita 1 -0.4677 0.1798 -0.8202 -0.1152 6.76 0.0093

Scale 0 1.0000 0.0000 1.0000 1.0000


2.4 Residuals for Log-Linear Models

A wide variety of diagnostics for checking model assumptions are available for logisticregression and for linear models with normally distributed data. Scatter plots of variouslydefined residuals make up the bulk of these methods. The IPLOTS and INFLUENCE op-tions in the LOGISTIC procedure, for example, produce a large number of useful plotsthat are specific to logistic models. There are also a number of statistical tools availablefor models of discrete data and log-linear models. GENMOD produces these with the OB-STATS option as described in this section.

There are six different residuals produced by GENMOD when the OBSTATS option isspecified in the MODEL statement. A numerical example of these is given in Output 2.2.The most obvious of these residuals is the raw residual or Resraw. The Resraw residual is


the difference between the observed and expected (predicted or fitted) counts. That is,

Resraw = count − Pred

where count denotes the observed frequencies in the data.These residuals are familiar to everyone doing statistical modeling of data using linear

regression, assuming normally distributed errors. Good statistical practice requires an ex-amination of the residuals of fitted models. Unlike linear regression models that assumeconstant variances for all observations, raw Poisson residuals have variances that are equalto their means. So this simple difference is not always a useful statistic to detect lack offit. In particular, the largest Resraw residual in absolute magnitude does not necessarilyindicate the worst fitting observation in the data set. Large raw Poisson residuals are likelyto correspond to the observations with the largest means. Conversely, small (absolute) rawresiduals are usually associated with small Poisson means and are not necessarily indica-tions of good fit either. In other words, a good definition of a “residual” for Poisson datawill need to take into account the equality of the mean and the variance. The Reschi is thePearson residual defined by

Reschi = count − Pred

(Pred)1/2.

The Reschi values are more comparable to each other than the raw residuals (Resraw)in locating lack of fit in the model because their variances are all nearly equal. Dividingthe raw residuals by their estimated standard deviation (Pred1/2) yields residuals all withapproximately unit variances.

The denominator of the Reschi residual takes into account the property that the Poissonvariances are equal to their means. The Reschi values should all have an approximatemean of zero and a variance of one. The approximate variance is only correct when noparameters need to be estimated. A standardized Pearson residual (described below) takesinto account the change in variability that is associated with the parameters being replacedby their estimates in the Pred values of the Reschi.

The Pearson chi-squared goodness of fit statistic

χ2 =∑ (count − Pred)2

Pred=

∑Reschi2

is the sum of squared values of the Reschi. As with most residuals, the Reschi values arenot mutually independent. The easiest way to show this dependence is to note that thereare more terms in the sum of the chi-squared statistic than there are degrees of freedom.

The chi-squared statistic is introduced at Equation 2.4 in Section 2.2 and should bereadily familiar to you as an overall test of adequacy of the model to explain the data. Alarge (absolute) Reschi value corresponds to a data point that makes a large contributionto the chi-squared statistic and any lack of fit in the model. It is often useful to plot theReschi values by the expected counts Pred or by some other important covariate value.Sections 4.3, and 4.4 contain examples of such plots of the Reschi residuals.

Another residual for Poisson data is based on the deviance statistic. The deviance statis-tic is introduced at Equation 2.5. It is defined as

G2 = 2∑

count log(count/Pred)

and should be close in value to the Pearson chi-squared value when all of the expectedcounts are large.

The deviance residual or Resdev is

Resdev = ±√2{count log(count/ Pred) + Pred − count}1/2.


where the ± is determined by whether the observed count is greater than or less than thefitted (Pred) value.

As with the other two types of Poisson residuals, the Resdev values are not mutuallyindependent. The values of the deviance residuals are normalized and so are comparableto each other. A standardized version of the Resdev residuals with estimated parametersis given below. The Resdev residuals are useful in identifying poorly fitted observationsin the data. They should also be close in value to the corresponding Reschi residuals,provided that the Pred values are not too small.

The sums of the fitted (Pred) and observed counts are both equal to the sample size.The Pred − count term in the definition of Resdev sums to zero so the deviance is thesum of the squared deviance Resdev residuals.

Another group of residuals is also available in GENMOD by specifying the OBSTATSoption in the MODEL statement. These residuals take into account the decreased variablityin the previously described residuals that is associated with substituting parameters by theirestimated values. Their values should not be very different, in general, and can often beused interchangeably.

The standardized Pearson residual is defined by

StReschi = count − Pred√(1 − h) Pred

where h is a function of the Hessian matrix reflecting the variability of the estimated pa-rameters in the fitted (Pred) counts. These residuals are called StReschi, in the OBSTATSoutput and Table 2.2 for example. The variance of StReschi should be closer to unity thanthat of the Reschi residual when parameters are estimated.

The standardized deviance residual is defined as

StResdev = Resdev/√

1 − h

These residuals are also obtained through the OBSTATS option in GENMOD.Finally, the standardized likelihood residual given by

Reslik = ±√

(1 − h)StResdev2 + h StReschi2

is a hybrid between the standardized Pearson and deviance residuals. The ± sign is deter-mined by whether the observed count is greater than or less than the fitted Pred value. Alarge absolute value of Reslik is an indication of a data point that is highly influential inestimating parameters of the model.

The use of these residuals to identify outliers is illustrated in Chapters 4 and 5.

2.5 Tests of Statistical Significance

There are different methods used to measure the strength of association between two cate-gorical variables. Four different approaches are described here and an example is summa-rized in Table 2.5. These approaches to significance testing are in contrast to the actual teststatistics such as chi-squared and deviance.

Tests of statistical significance are an important topic in model building and inference.To continue the example given in Table 2.3 of Section 2.3, this section describes severalstatistical tests of the relationship between a mother’s cigarette habit and infant vitality.Suppose the interaction between smoking and vitality λcv is most important to you andyou want to examine the statistical significance between these two variables more care-fully. This section compares four different methods reported by the FREQ and GENMODprocedures to measure the statistical significance of this interaction.


In particular, consider Equation 2.13 for the data in Table 2.3

log mgacv = µ + αg + βa + γc + δv + εga + ηgv + θac + κav + λcv.

That is, all pairwise interactions are present except the interaction between gestational age(g) and mother’s smoking habits (c). This model has a good fit to the data: (χ2 = 1.89;6 df; p = .93). Output 2.9 for this model is given in Section 2.3.

2.5.1 Marginal and Wald Significance Levels

The first approach you might take is to examine the marginal significance between smokingand vitality. This straightforward method examines the marginal association between thetwo variables. Table 2.4 presents the marginal frequencies of smoking and infant vitality.

TABLE 2.4 The marginal association between infant vitality and mother’s smoking habit is measuredwith the data in this table.

Vitality

Smoking habit Dead Alive Total

Low 129 5968 6097High 20 634 654

Total 149 6602 6751

The marginal data summary of Table 2.4 is obtained using PROC FREQ:

proc freq;tables cigs * vita / chisq;weight count;

run;

This procedure examines the data of Table 2.4 and produces a large number of statis-tical measures of association. Table 2.4 is called a marginal table because it sums overall variables not involved in the interaction of interest. The marginal significance be-tween smoking and infant vitality is only a function of the data in this table. The chi-squared statistic for this table is 2.43 and has a p-value of .1190. The deviance statisticis G2 = 2.20; p = .1379. The choice of test statistics is not the concept being empha-sized here. The marginal significance refers to the approach of examining Table 2.4, notthe actual test statistic employed.

The marginal approach to significance testing is very easy to explain and interpret.Marginal significance in this example examines the 2×2 table in a simple manner by sum-ming over all the variables that are not involved in the relationship of interest. Marginalsignificance might produce misleading conclusions, however, if there are strong relation-ships between the summed variables and those present in the table. In the present data,there is a strong relationship between a smoking habit and the mother’s age, for example.A problem with summing over such a strong effect is that doing so might introduce, re-move, or even reverse the direction of other effects. This is often referred to as Simpson’sparadox. An extreme example of Simpson’s paradox is given in Zelterman (1999, p. 127).While a marginal analysis is the easiest approach in explaining an interaction, you alsoneed to take into account the other effects present in the data and their interactions with thevariables of primary interest.

Other approaches to the measurement of the statistical significance of the relationshipbetween smoking and perinatal mortality are performed by GENMOD. These are obtainedby fitting and comparing various log-linear models to the data. The risk of Simpson’s


paradox is reduced by including all of the important interactions between variables in themodel.

The following program produces three different statistical significance levels for therelationship between smoking and infant vitality in Table 2.3. These lines go at the end ofProgram 2.2 and produce Output 2.11.

Program 2.3 proc genmod;model count = gest age cigs vita gest*age

age*cigs cigs*vita gest*vita age*vita / dist = Poissontype1 type3;

run;




Error

Wald 95%Confidence


Intercept 1 3.9392 0.1265 3.6913 4.1870 970.41 <.0001

gest 1 -0.7509 0.1824 -1.1084 -0.3934 16.95 <.0001

age 1 -0.2826 0.1703 -0.6164 0.0512 2.75 0.0971

cigs 1 -1.7323 0.2431 -2.2088 -1.2558 50.77 <.0001

vita 1 1.8193 0.1346 1.5554 2.0831 182.65 <.0001

gest*age 1 -0.2215 0.0962 -0.4101 -0.0329 5.30 0.0214

age*cigs 1 -0.3545 0.0997 -0.5498 -0.1591 12.65 0.0004

cigs*vita 1 -0.4241 0.2446 -0.9036 0.0554 3.01 0.0830

gest*vita 1 3.2886 0.1846 2.9268 3.6504 317.32 <.0001

age*vita 1 -0.4821 0.1801 -0.8352 -0.1291 7.16 0.0074

Scale 0 1.0000 0.0000 1.0000 1.0000



LR Statistics For Type 1 Analysis

Source Deviance DF Chi-Square Pr > ChiSq

Intercept 20053.1218

gest 14822.4974 1 5230.62 <.0001

age 13364.9714 1 1457.53 <.0001

cigs 8301.9060 1 5063.07 <.0001

vita 374.1426 1 7927.76 <.0001

gest*age 363.3681 1 10.77 0.0010

age*cigs 350.5508 1 12.82 0.0003

cigs*vita 348.3500 1 2.20 0.1379

gest*vita 8.8418 1 339.51 <.0001

age*vita 1.8897 1 6.95 0.0084





Source DF Chi-Square Pr > ChiSq

gest 1 18.22 <.0001

age 1 2.79 0.0948

cigs 1 71.82 <.0001

vita 1 249.61 <.0001

gest*age 1 5.18 0.0228

age*cigs 1 13.34 0.0003

cigs*vita 1 2.73 0.0986

gest*vita 1 333.77 <.0001

age*vita 1 6.95 0.0084

Output 2.11 gives three sets of significance levels for all of the interactions in Equa-tion 2.13. The most familiar of these is based on the Wald statistic, which is given alongwith the parameter estimates at the top of the output. The squared parameter estimate isdivided by its estimated variance to produce an approximate 1 df chi-squared test of thehypothesis that the population parameter is zero. Estimated standard errors of parameterestimates in the Wald statistic are obtained from the curvature or second derivative of thelog-likelihood function. The log-likelihood and this curvature is described at greater lengthin Section 2.6.

2.5.2 Type 1 and Type 3 Significance Levels

The Type 1 and Type 3 methods provide the other two sets of significance levels in Out-put 2.11. The Type 1 method compares the value of the deviance statistic without and thenagain with the term of interest after fitting all of the model terms listed before this onein the MODEL statement. Specifically, the Type 1 analysis of the smoking and vitalityinteraction (λcv) compares the improved fit of the model

log mgacv = µ + αg + βa + γc + δv + εga + θac + λcv, (2.15)

including the λcv term, over the model

log mgacv = µ + αg + βa + γc + δv + εga + θac, (2.16)

without the λcv term. The choice of terms that are included in these two models dependsupon the order in which they are listed in the GENMOD MODEL statement.

Two models are said to be nested if one contains a subset of the terms in the other. Thedifference of the values of the deviance statistics on two nested models behaves approxi-mately as a chi-squared statistic with degrees of freedom equal to the difference of the dffor the two models. This property of the deviance statistics is also described in Section 2.2following its introduction at Equation 2.5. In the present example, Equations 2.15 and 2.16differ only by the λcv term, which represents 1 df.

The Type 1 Analysis section of Output 2.11 shows that the deviance for Equa-tion 2.15 is 350.55 and is 348.35 for Equation 2.16. The difference of 2.20 is the value ofthe Type 1 significance test for this parameter in the model.


The Type 1 approach to significance testing depends upon the order in which the in-teractions are entered into the model or listed in the MODEL statement in the GENMODprogram. The Type 1 analysis might therefore be affected by this somewhat arbitrary order.The Wald test does not depend on this order. The Type 1 comparison of Equations 2.15and 2.16 also does not take into account the highly significant parameters ηgv and κav

present in Equation 2.13 but not yet entered into Equation 2.15 at the point that λcv isbeing tested.

The listed order of interactions seems a rather arbitrary criteria to determine statisticalsignificance so another approach called the Type 3 or partial association is offered byGENMOD. Partial association is the significance level of a parameter when that parameteris added last, after having already fitted all other parameters into the model. The Type 3statistics of Output 2.11 are the result of GENMOD fitting several log-linear models, eachwith, and then again without, one of the model terms.

In the case of λcv the Type 3 analysis compares the deviance of the log-linear model

log mgacv = µ + αg + βa + γc + δv + εga + ηgv + θac + κav,

without the λcv interaction, to the deviance of Equation 2.13

log mgacv = µ + αg + βa + γc + δv + εga + ηgv + θac + κav + λcv,

containing this λcv interaction.For models with many terms, the Type 3 approach is the preferred method. In the recent

past this method was computationally slow, but this is no longer an important issue. TheType 3 approach to significance testing avoids the risk of Simpson’s paradox because allother important interactions are included in the model. The Type 3 analysis also does notrely on the arbitrary order in which log-linear model terms are listed in the GENMODMODEL statement.

All of the different approaches to measuring the statistical significance of the relation-ship between infant vitality and the mother’s smoking habit are summarized in Table 2.5.Despite the wide range of interpretations and approaches used to obtain these test statistics,notice that there is not a wide range of p-values associated with these tests. The p-valuesrange between .08 and .14. Different assumptions are made to obtain each of these values.In the event that there is a great difference in these approaches, then the Type 3 significancelevels should be more heavily relied upon than the others. The Type 3 partial associationavoids the pitfalls of Simpson’s paradox and does not depend on the order that terms arelisted in the MODEL statement. In this example, the Type 3 approach provides a p-valuethat is neither the largest nor the smallest of the four methods compared.

TABLE 2.5 Different approaches to measuring the statistical significance of the association betweeninfant vitality and the mother’s smoking habit.

1 dfchi-squared

Approach statistic p-value

Marginal χ2 2.43 .12G2 2.20 .14

Wald test 3.01 .08Type 1 2.20 .14Type 3 2.73 .10


2.6 The Likelihood Function

This section provides some additional theory that might aid you in interpreting the GEN-MOD output. This is not a substitute for a full course in mathematical statistics but shouldprovide just enough non-calculus introduction for an informed GENMOD user to appreci-ate the output. This section should remove some of the mystery associated with estimationof parameters and construction of confidence intervals. The methods described in this sec-tion are not particular to log-linear models. Their introduction here is intended to clarifyhow GENMOD estimates parameters and builds confidence intervals, in general.

Let yi (i = 1, . . . , k) denote independent Poisson observations with respective meansmi . Every mi must be strictly positive. Otherwise the corresponding yi will have a de-generate distribution. The means mi are not free to range all over but are constrained tofollow some log-linear model. In most cases a model of the means is specified so that itcan be interpreted or used to draw inference on some important aspect of the original data.Examples of useful log-linear models are given in Sections 2.2 and 2.3.

The Poisson log-likelihood is very important and central to most of the statistical in-ference that is drawn from data that is fitted to log-linear models in GENMOD. Begin bywriting the logarithm of the joint probability (likelihood) of the observed data

log∏

i

{e−mi myi

i /yi} =

∑i

yi log mi −∑

i

mi −∑

i

log yi !

Recognize the term in ‘{}’ braces as the Poisson probability of yi with mean mi . The yi areindependent, so products of the probabilities are taken here.

The term involving the yi ! is not a function of any parameters that need to be estimatedor tested so this term will be ignored in everything that follows. The remaining function

l(y, m) =∑

yi log mi −∑

mi (2.17)

is called the Poisson log-likelihood and plays a major role in the way GENMOD fits log-linear models, tests parameters, and builds confidence intervals. It is also the basis of otheruseful computing results from Poisson distributed data yi .

The Poisson Log Likelihood, given by the GENMOD procedure in Table 2.1 forexample, is calculated as

l(y, m) =∑

yi log mi − N . (2.18)

where m = {mi } are the values of the mi that maximizes l(y, m), subject to the formof the log-linear model. The mi are the fitted Poisson means or Pred values obtained byGENMOD and printed as part of the OBSTATS output. In almost every setting, the sum ofthe estimated means mi is equal to the sample size N . That is,

∑mi = N for most models

you are likely to encounter.The GENMOD procedure finds those estimates m of m that maximize the log-likeli-

hood l subject to the conditions of the log-linear model. Such estimates m are referred toas maximum likelihood estimates. These estimated values m are not the most likely, as theirname seems to imply, but instead are the parameter values that maximize the probabilityof the observed data yi . In other words, maximum likelihood finds parameter values thatmaximize the probability of observing the given data.

There are, of course, other ways of estimating parameters from data besides the methodof maximum likelihood. The WLS option in the CATMOD procedure, for example, canbe used to find weighted least squares estimates of the mi that minimize the values of thePearson chi-squared goodness of fit statistic. This minimum chi-squared method is entirelyreasonable if you are most interested in judging the fit of a model and want to provide thebest possible fit. There are mathematical results (Zelterman 1999, pp. 134–5) proving theequivalence of maximum likelihood estimates and minimum chi-squared estimates whenlarge sample sizes are involved. Other methods of parameter estimation include those that


are unbiased or that minimize the expected squared error. In most settings there shouldbe little difference between these estimates and maximum likelihood estimates. Maximumlikelihood estimation with normally distributed data is also known as least squares esti-mation. Maximum Poisson likelihood estimation is preferred to least squares or weightedleast squares for Poisson distributed data.

If you simplify the problem greatly, it is easier to proceed with this discussion. Supposethe means m depend on only one scalar valued parameter that will be denoted by θ . Thatis, if the value of θ were known, then the values of all means m1(θ), . . . , mk(θ) would beknown as well. (A numerical example of this setting with a single parameter θ is given atEquation 2.21 below.) The problem then, is to estimate and draw inference on the value ofthe scalar θ . In most problems there are typically several parameters to be estimated whoserelatively small number is less than the number of means. For example, in Equation 2.6the parameters (µ, αi , β j ) all need to be estimated. This more general problem with theestimation of several parameters is more difficult to illustrate but follows the same generalprinciples that hold for the estimation of a single parameter. Reducing the problem to thatof estimating a single parameter θ allows for the simplified illustration in Figure 2.1.

Figure 2.1The log-likelihood function l(θ) given at Equation 2.19. Indicated are the efficient score test,the deviance (likelihood ratio) statistic, and the Wald statistic for testing the null hypothesisthat the true (population) parameter of θ is equal to the value θ0. The maximum likelihoodestimate of θ is denoted by θ . In most situations you take θ0 to be zero.

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.......................................

.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

Log-likelihood of θ@

@@

@@I

Wald statistic� -

θ θ0

JJJJJJJJJJJJJJJJ

The score test is

the slope at θ0

Half of the

deviance

6

?


For most problems the likelihood function

l(θ) = l{y, m(θ)} =∑

yi log mi (θ) − N (2.19)

obtained at Equation 2.18 will take the shape of a downward facing parabola such as pic-tured in Figure 2.1. If θ is a vector-valued parameter then this figure takes an analogousmultivariate shape. The value θ of θ that maximizes l(θ), subject to the form of the model,is the maximum likelihood estimate of θ .

2.6.1 The Variance of the Parameter Estimate

The most prominent feature of likelihood-based estimation is the ease with which the stan-dard error of the estimator θ from the likelihood function can be approximated. Roughlyspeaking, sharper peaks of the likelihood function are associated with smaller estimatedstandard errors of the parameter being estimated. Similarly, if the top of the likelihood inFigure 2.1 is broad and flat, then the parameter being estimated is known with less cer-tainty and will have a larger standard error. The sharpness of the peak and the standarderror are expressed through the second derivative of the log-likelihood function, describedat Equation 2.20 below.

The first derivative of the likelihood function l(θ) is called the score function and it mea-sures the slope of the likelihood function at any value of the parameter θ . The GENMODprocedure solves for the value of the parameter that maximizes the likelihood. This is doneby solving for the parameter value that sets the score function equal to zero. The slope ofthe likelihood will always be zero at its top so the score or first derivative of l(θ) is onlyuseful for identifying the maximum likelihood estimate. The SAS procedure NLIN, forexample, prints out values of the score function as a means of demonstrating how close theestimation procedure has been able to reach to the top of the likelihood. The score functioncan also be used to test hypotheses about θ and is described below.

The curvature of the log-likelihood function is measured through its negative secondderivative, also called the observed information of θ . The negative sign is needed becausethe log-likelihood is curving downwards and the second derivative at the maximum is anegative number. The observed information is then a positive number. Symbolically,

Observed Information = −(∂/∂θ)2l(θ)

evaluated at θ .The larger the information is, the sharper the peak of the likelihood and the smaller

the estimated standard error of the parameter. In simple terms, the relationship used byGENMOD to estimate the variance of θ is

Var(θ) ≈ 1/Observed Information. (2.20)

That is, the sharper the curvature at the top of the log-likelihood, the smaller the estimatedvariance of the parameter being estimated.

The functional form of the Poisson likelihood function is given at Equation 2.18 andits typical shape is plotted in Figure 2.1. The discussion up to Equation 2.20 has greatlysimplified the process used by GENMOD in approximating standard errors of parameterestimates. Several parameters must be estimated in most practical settings, but the generalprincipal of examining the second derivative of the likelihood still holds. The first deriva-tive or score function is equated to zero and solved to find the parameter estimates, such asthose that are given in the GENMOD output. The second derivative (in the multiple param-eter setting) is a matrix that is inverted to provide estimates of the standard errors that arealso given in this output. This discussion should provide you with a better understandingof the process that GENMOD uses to estimate parameters and their standard errors. The


remainder of the theory in this section describes the process of testing hypotheses aboutthe parameters but first we want to reinforce all of the points made so far with a numericalexample.

2.6.2 A Numerical Example

Table 7.4 describes the number of lottery winners in several towns in the New Haven area.The number of lottery winners yi should have a Poisson distribution because of the bino-mial approximation in which many tickets are sold and each has a very small probabilityof winning. It stands to reason that more winners would be expected in towns with largerpopulations. Consider a simple model in which the mean number of winners mi is propor-tional to the population (Popi ) of the i th town. That is, the number of winners yi in the i thtown has a Poisson distribution with mean mi that satisfies the model

mi = θ × Popi . (2.21)

for a scalar valued parameter θ to be estimated.More detailed models and analyses of this data are presented in Section 7.4. The aim

of this example is to describe the statistical inference on the single parameter θ in Equa-tion 2.21.

The Poisson log likelihood function (Equation 2.17) for Equation 2.21 is

l(θ) =∑

yi log mi −∑

mi

=∑

yi log(θPopi ) −∑

θPopi

= log(θ)∑

yi +∑

yi log(Popi ) − θ∑

Popi

Numerically, it calculates the values∑yi = 135.,∑

Popi = 477.6,

and ∑yi log(Popi ) = 434.832

from the data in Table 7.4. The log likelihood for this problem can then be written as

l(θ) = 434.832 + 135. log(θ) − 477.6θ. (2.22)

This function l(θ) is plotted in Figure 2.2.The model given by Equation 2.21 can be fitted in GENMOD using the code

proc genmod;model winners = pop / noint link=identity dist=Poisson;

run;

Use the identity link to model the mean, instead of the log mean for this Poisson data.There is no intercept in the model mi = θPopi so the NOINT option is needed.

The output from this program is given in Output 2.12. Issues such as goodness of fit orthe appropriateness of this model are described in Section 7.4. This section verifies someof the calculated values in Output 2.12.


Figure 2.2A plot of the log-likelihood l(θ) given at Equation 2.22 for the model of Equation 2.21

.2 .3 .4

θ

110.

120.

130.

l(θ)

θ = .2827

......................................................................................................................................

............................................................................................................................................................................................................................

.............................................................................................................................................................................................................................................................................................................................................................................................................................

....................................

........................................................

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

The maximum of the log likelihood function Equation 2.22 occurs at

θ = 135/477.6 = .2827

and this value is also given as the estimated coefficient in Output 2.12.




Deviance 20 30.1754 1.5088








Error

Wald 95%Confidence


Intercept 0 0.0000 0.0000 0.0000 0.0000 . .

popul 1 0.2827 0.0243 0.2350 0.3303 135.00 <.0001

Scale 0 1.0000 0.0000 1.0000 1.0000







Error

LikelihoodRatio 95%Confidence


Intercept 0 0.0000 0.0000 0.0000 0.0000 . .

popul 1 0.2827 0.0243 0.2376 0.3331 135.00 <.0001

Scale 0 1.0000 0.0000 1.0000 1.0000


The numerical value of the log likelihood (Equation 2.22) is l(·) evaluated at θ . That is,

l(.2827) = 434.832 + 135. log(.2827) − 477. × (.2827) = 129.2597.

This value is given in Output 2.12 as the Log Likelihood.Continuing with this example, the second derivative of the log-likelihood given at Equa-

tion 2.22 with respect to θ is

(∂/∂θ)2l(θ) = −135/θ2.

At its maximum, the second derivative of the log likelihood is

(∂/∂θ)2l(θ)

∣∣∣θ

= −135/θ2 = −135/(.2827)2 = −1689.64.

Similarly, the observed information is 1689.64 .You then use Equation 2.20 to estimate the variance of θ as

1/1689.64 = 5.918 × 10−4.

Finally, the estimated standard error of θ is√5.918 × 10−4 = .0243,

which is also given as the standard error of θ in Output 2.12.The derivation of the Wald confidence interval for θ in Output 2.12 is

.2827 ± 1.96 × .0243 = (.2350, .3303)

This interval and the Wald chi-squared test are described at Equation 2.9 in Section 2.2.The likelihood ratio confidence interval for θ is explained below.

In practice, fitted models will generally contain several parameters, each of which hasan estimate value and an approximated standard error. The same principles illustrated inthis example continue to hold.


2.6.3 Hypothesis Tests and Confidence Intervals

There are three popular methods for testing hypotheses about a parameter θ based on thelog-likelihood function that is illustrated in Figure 2.1. The most commonly seen hypothe-sis is that the unknown population parameter is equal to some fixed value denoted θ0. Mostoften θ0 is taken to be zero. In regression settings, for example, testing that a regressioncoefficient is zero asks if the explanatory variable has any effect on the dependent variable.Output 2.7, for example, contains tests of whether there is a non-zero interaction effectbetween the maternal and gestational ages.

The most familiar of the three hypothesis tests of θ = θ0 is called the Wald test. Thistest of the null hypothesis θ0 = 0 is printed by GENMOD, as given in Output 2.7 for ex-ample. The squared difference between θ and θ0 is divided by the estimated variance of θ

to produce an approximately 1 df chi-squared distributed test. This statistic is also given atEquation 2.8 in Section 2.2.1. At the bottom of Figure 2.1 this test corresponds to the dif-ference between θ and θ0 on the horizontal scale. The chi-squared distribution of the Waldstatistic holds under the null hypothesis of the correct model being fitted. The behavior ofchi-squared statistics under the alternative hypothesis is described in Section 9.3.

The (vertical) difference between the log-likelihood l(·) evaluated at θ0 and at its max-imum (θ) is a second method of testing the hypothesis. If the true unknown populationvalue of the parameter is θ0 (or zero, for example), then the maximum likelihood estimateθ of θ should be close to this value. The value of the likelihood will be greatest at its max-imum, of course, and its value at the true parameter value θ0 will be slightly less. Refer tothe top of Figure 2.1 to make this distinction clear.

The difference between the value of the likelihood at its maximum l(θ) and at the pop-ulation value l(θ0) should not be too large. Intuitively, l(θ) should not be much larger thanl(θ0) if θ and θ0 are close together. Similarly, if the true population value of θ is θ0 then θ

and θ0 should not be very different in value.A well-known result (also attributed to A. Wald) is that under the null hypothesis that

the true population value of θ is θ0, twice the difference of l(θ) and l(θ0) behaves as achi-squared random variable with 1 df. The statistic based on this difference is often calledthe likelihood ratio or the deviance. The likelihood ratio statistic for Poisson data is thedeviance statistic given at Equation 2.5. That is,

G2 = 2{l(θ) − l(θ0)}.A third method for testing the hypothesis of θ = θ0 is attributed to C.R. Rao who

suggests that you look at the score function or slope of the log-likelihood at θ0. If θ0 isthe true population value of θ , then the score function at θ0 should be close to zero. Oneadvantage of this method is that there is no need for the actual value of θ that maximizes thelog-likelihood in order to evaluate the score test. This method is illustrated in Figure 2.1.

Confidence intervals for θ can be found by the same methods that test the hypothesisthat θ is equal to θ0. In the case of the Wald test, this is easy to do. A 95% confidenceinterval for θ is given by θ plus and minus 1.96 times the estimated standard error of θ .This method is described at Equation 2.9 in Section 2.1.

The likelihood ratio-based confidence interval for θ is, computationally, a much moredifficult problem to solve. This procedure is specified by the LRCI option. The likelihoodratio 1−α% confidence interval is obtained through an iterative procedure that finds valuesof θ for which

2{l(θ) − l(θ)

}is equal to the upper α% of a chi-squared variate with df equal to the dimension of θ .

This equation is easy to solve when θ is a scalar, but can be computationally intensivewhen θ is a vector of parameter values. The GENMOD manual warns that requesting theLRCI option can result in a long wait for the answer, especially when a large numberof parameters are involved. With large samples there should only be a small differencebetween the (easily obtained) Wald confidence interval and that from the LRCI option.


The likelihood ratio confidence interval for θ in the lottery example with log likelihoodgiven at Equation 2.22 can be illustrated. The GENMOD code to obtain likelihood ratioconfidence intervals for the lottery example is

proc genmod;model winners = pop / noint link=identity dist=Poisson lrci;

run;

The likelihood ratio statistic for this problem is

2{l(θ) − l(θ)} = 2{129.2597 − (434.832 + 135 log θ − 477.6θ)}= 955.2θ − 270 log θ − 611.1446

and has to be compared to the upper 95th percentile of a 1 df chi-squared value of 3.84.Specifically, the equation

2{l(θ) − l(θ)} = 3.84

has two solutions in θ , namely .2376 and .3331. These two values are given as the likeli-hood ratio confidence interval in Output 2.12. These values are not very different from theWald interval of .2350 and .3303 also given in Output 2.12.

Inference based on the log-likelihood function is a function of intuitive measures thatare indicated in Figure 2.1.

Chapter

3Ordered CategoricalVariables

3.1 Introduction 533.2 Log-Linear Models with One Ordered Category 533.3 Two Cross-Classified Ordered Categories 59

3.1 Introduction

The programs based on Table 2.1 show that indicator variables can be useful in examiningcategories in discrete valued, cross-classified data. This chapter and the following giveexamples of analyses that demonstrate some of the possible advantages of creating yourown independent variables in regression models for discrete data. The examples of thischapter contrast both the benefits and the shortcomings of relying on the CLASS statementto build your regression variables for you.

This chapter focuses on using ordered categories of data, rather than unordered cate-gories. Some examples of unordered categories include the numbers of people categorizedby religion, race, or ethnicity. These categorical variables are good candidates for a CLASSstatement in which all the categories can be compared to each other or to a single baselineor reference group. Ordered categories do not lend themselves to this sort of analysis.

The statistical analysis of ordered categorical variables will almost always benefit frommaking use of their inherent ordering. Two examples of data with ordered categories aregiven in Tables 3.1 and 3.3. In the first of these examples, coal miners are categorized bytheir ages, which are grouped into five-year intervals. You can use the natural ordering ofthe age categories to create models in SAS with meaningful covariates that have a naturalinterpretation.

The second example, which has two ordered, cross-classified categorical variables, isdescribed in Section 3.3. In this example, which involves trauma outcome and treatment,both margins of Table 3.3 are ordered, and a single parameter interaction term is fitted thatprovides an improved fit over the model of independence.

3.2 Log-Linear Models with One Ordered Category

The data in Table 3.1 is given by Ashford and Swoden (1970) and McCullagh and Nelder(1989, p. 230). This table lists the health complaints of United Kingdom coal miners, orga-nized by age. The miners are classified as suffering from either breathlessness, wheezing,both, or neither. All miners in the survey are smokers and have been diagnosed with pneu-moconiosis. Miners were excluded if they were retired or were absent at the time of thesurvey for health reasons. There are some strong biases in this survey data and its epi-demiologic value is limited. Additional discussions and analyses of this data are given byPlackett (1981, p. 87) and Mantel and Brown (1973).


TABLE 3.1 Coal miners organized by age who are reporting breathlessness and wheezing. Source:Ashford and Swoden (1970).

Breathlessness No breathlessness Odds ratio

Age Wheeze No wheeze Wheeze No wheeze Empirical Log

20–24 9 7 95 1841 24.9 3.2125–29 23 9 105 1654 40.3 3.7030–34 54 19 177 1863 29.9 3.4035–39 121 48 257 2357 23.1 3.1440–44 169 54 273 1778 20.4 3.0245–49 269 88 324 1712 16.2 2.7950–54 404 117 245 1324 18.7 2.9355–59 406 152 225 967 11.5 2.4460–64 372 106 132 526 13.9 2.63

Totals 1827 600 1833 14,022 23.3 3.15


Logistic regression is the best tool for a separate analysis of the effects of breathlessnessby age or wheezing by age. Such analyses, however, are marginal to the full examinationof this data as an ordered set of 2 × 2 tables. The emphasis of this section is to providea model for the changing odds ratios for these 2 × 2 tables. A more general regressionprogram for fitting models of odds ratios is described in Section 8.4.

There are two interesting features about this data that make it worthy of detailed inves-tigation here. First, the age variable is ordered and listed by five-year categories. Initially,you can use the CLASS statement to describe the age variable.

Second, the sample sizes are relatively large. Statisticians working with such largesamples often complain that “everything is significant.” That is to say, every term addedto the model provides a large amount of improvement over the model that does not in-clude this term. The conclusion that is often reached is that only a saturated model withone parameter for every categorical cell adequately fits the data. Any model with fewerparameters than this saturated model will have a large goodness of fit chi-squared value,indicating that an adequate explanation of the data has not yet been reached.

This problem arises from the high power of the chi-squared statistic. With largesamples, the chi-squared statistic is able to detect very subtle, statistically significanteffects in the data, including those whose practical effect is negligible. A detailed discus-sion of the power of the chi-squared statistic and how its power is related to sample size isgiven in Chapter 9. The sensitivity of the test leads one to ask about the separate issues ofstatistical versus practical significance. A statistically significant effect might have little orno practical meaning. This distinction is also made in Chapter 9.

Let nbwa denote the count in the breathlessness (b = 0, 1), wheeze (w = 0, 1), and age(a = 1, . . . , 9) categories. The mean of nbwa is denoted by mbwa . The log-linear modelwith all pairwise interactions

log mbwa = µ + αb + βw + γa + θbw + λba + φwa (3.1)

has a poor fit to the data: (χ2 = 26.63; 8 df; p < .001). This is shown in the goodness offit statistics for this model in Output 3.1.

Chapter 3 Ordered Categorical Variables 55

The following program fits all the pairwise interactions in Equation 3.1 and the linearthree-factor interaction in Equation 3.3 to the coal miner data of Table 3.1. Portions ofthe output appear in Outputs 3.1 and 3.2. The fitted values of Equation 3.3 are given inTable 3.2.

Program 3.1 title1 ’Coal miners: breathlessness and wheezing by age’;data;/* Read data as a 2x2 table for each age group.

Produce a data set with one frequency per line. */input age a b c d;label

age = ’5 year interval’br = ’breathlessness’wh = ’wheeze’bwa = ’linear age times br-wh interaction’;

age=age/5 - 3; /* recode age categories: 1 to 9 *//* Output four lines of frequencies for each age group.

bwa is a log-odds ratio that is linear in age */br=1; wh=1; bwa= age; freq=a; output;br=1; wh=0; bwa=-age; freq=b; output;br=0; wh=1; bwa=-age; freq=c; output;br=0; wh=0; bwa= age; freq=d; output;drop a b c d; /* omit the original 2x2 table */datalines;

20 9 7 95 184125 23 9 105 165430 54 19 177 186335 121 48 257 235740 169 54 273 177845 269 88 324 171250 404 117 245 132455 406 152 225 96760 372 106 132 526

run;title2 ’Log-linear model with all pairwise interactions’;proc genmod;

/* age as a class variable ignores its ordering */class age br wh;model freq = br | wh | age @2 / dist = Poisson obstats;

run;

title2 ’All pairwise interactions + linear odds ratio’;proc genmod;

/* bwa models the interaction of br and wh as linear in age */class age br wh;model freq = br | wh | age @2 bwa / dist = Poisson obstats;

run;





Deviance 8 26.6904 3.3363





There is no benefit to using the ordering of the age variable here because of the lackof overall fit. The following examination of this data concentrates on building three-dimensional interaction effects. These effects describe the relationship between wheezingand breathlessness and how this relationship changes with age.

Instead of using the @2 option as in Program 3.1, the inclusion of a three factor interac-tion ηbwa in Equation 3.1 could be fit using the @3 option in the SAS statements

proc genmod;class age br wh;model freq = br | wh | age @3 / dist = Poisson obstats;

run;

This program, however, fits a saturated model with a perfect fit (that is, χ2 = 0) thatuses up all of the available df and is therefore not useful. The problem is to find a log-linearmodel the lies in between Equation 3.1 and this saturated model.

The construction of a better fitting log-linear model is based on how the three-factorinteraction ηbwa is constructed and interpreted. In the original data of Table 3.1, thereis a large (pairwise) interaction between breathlessness and wheeze. This interaction isnot constant and the empirical odds ratios in Table 3.1 generally decrease with age. Thethree-factor interaction can be interpreted as a modeling of these differing odds ratios foreach age. Odds ratios are illustrated in Section 2.2.2 and are examined again in Chapter 8.As shown above, the use of the AGE variable in a CLASS statement with a three-factorinteraction in a MODEL statement creates a set of variables that fit a separate odds ratiofor each age. This uses up all of the degrees of freedom and results in a saturated model.This is because the use of AGE as a CLASS variable ignores the natural ordering in itsvalues.

A useful addition to Equation 3.1 that allows for a change in the odds ratio at each age isto assume that the log-odds ratio is linear in age. How this is done in SAS is demonstratedbelow. The model with a linear trend in the log-odds appends the three-factor interaction

ξbwa ={

ξ age if b = w

−ξ age if b �= w(3.2)

to the terms in Equation 3.1.This model term is called BWA in Program 3.1 and uses 1 df to estimate the single

regression coefficient of parameter ξ . The fitted parameter value, its significance level,and its confidence interval are given in Output 3.2. The variable AGE is the recoded agecategory numbered 1 through 9 instead of the original scale in years.





Deviance 7 6.8017 0.9717








Error

Wald 95%Confidence


age*br 8 0 1 0.4213 0.0959 0.2333 0.6094 19.28 <.0001

age*br 8 1 0 0.0000 0.0000 0.0000 0.0000 . .

age*br 9 0 0 0.0000 0.0000 0.0000 0.0000 . .

age*br 9 1 0 0.0000 0.0000 0.0000 0.0000 . .

age*wh 1 0 1 1.6556 0.1337 1.3935 1.9177 153.24 <.0001

age*wh 1 1 0 0.0000 0.0000 0.0000 0.0000 . .

age*wh 2 0 1 1.4169 0.1287 1.1647 1.6691 121.24 <.0001

age*wh 2 1 0 0.0000 0.0000 0.0000 0.0000 . .

age*wh 3 0 1 1.0224 0.1132 0.8005 1.2444 81.51 <.0001

age*wh 3 1 0 0.0000 0.0000 0.0000 0.0000 . .

age*wh 4 0 1 0.8957 0.1046 0.6906 1.1007 73.32 <.0001

age*wh 4 1 0 0.0000 0.0000 0.0000 0.0000 . .

age*wh 5 0 1 0.5529 0.1025 0.3520 0.7538 29.09 <.0001

age*wh 5 1 0 0.0000 0.0000 0.0000 0.0000 . .

age*wh 6 0 1 0.3639 0.0978 0.1722 0.5556 13.84 0.0002

age*wh 6 1 0 0.0000 0.0000 0.0000 0.0000 . .

age*wh 7 0 1 0.3144 0.0958 0.1267 0.5021 10.78 0.0010

age*wh 7 1 0 0.0000 0.0000 0.0000 0.0000 . .

age*wh 8 0 1 0.2077 0.0947 0.0221 0.3933 4.81 0.0283

age*wh 8 1 0 0.0000 0.0000 0.0000 0.0000 . .

age*wh 9 0 0 0.0000 0.0000 0.0000 0.0000 . .

age*wh 9 1 0 0.0000 0.0000 0.0000 0.0000 . .

bwa 1 -0.1306 0.0295 -0.1884 -0.0728 19.62 <.0001

Scale 0 1.0000 0.0000 1.0000 1.0000



The interpretation of ξbwa in Equation 3.2 is a log-odds ratio of wheezing and breath-lessness for each age category. Specifically, if you add the ξbwa term to Equation 3.1, thenyou have

log {m11am22a/m12am21a} = 4ξa

for every age category a = 1, . . . , 9.The ξ parameter of Equation 3.2 could also be constructed as

ξbwa ={

ξ age if b = w = 10 otherwise.

In this case, the fitted categorical cell estimates are the same as they are in Equation 3.2.Notice the different construction of ξbwa here and at Equation 3.2. Unlike other interactioneffects in log-linear models, this ξ parameter has one degree of freedom and does not obeythe constraints for identifiability described in Section 2.2.

The log-linear model with a linear trend in the log-odds ratios is

log mbwa = µ + αb + βw + γa + θbw + λba + φwa + ξbwa (3.3)

where ξbwa was defined at Equation 3.2.This model contains all pairwise interactions and the linear three-factor interaction ξ .

Program 3.1 produces the fitted values given in Table 3.2. Some of the fitted parametervalues, including the BWA term, appear in Output 3.2. Equation 3.3 has a good fit to thedata: (χ2 = 6.81; 7 df; p = .45).

TABLE 3.2 Fitted means and marginal odds ratio for the coal miner data of Table 3.1 obtained byProgram 3.1. The fitted log odds ratios are linear in age. The fit of the model is good: (χ2 = 6.81; 7 df;p = .45).

Breathlessness No breathlessness Odds ratio

Age Wheeze No wheeze Wheeze No wheeze Fitted Log

20–24 10.2 5.8 93.8 1842.2 34.66 3.5525–29 21.2 10.8 106.8 1652.2 30.41 3.4230–34 52.5 20.5 178.5 1861.5 26.69 3.2835–39 121.4 47.6 256.6 2357.4 23.42 3.1540–44 169.3 53.7 272.7 1778.3 20.56 3.0245–49 274.8 82.2 318.2 1717.8 18.04 2.8950–54 393.3 127.7 255.7 1313.3 15.83 2.7655–59 418.8 139.2 212.2 979.8 13.89 2.6360–64 365.5 112.5 138.5 519.5 12.19 2.50

Totals 1827 600 1833 14,022

The linear three-factor interaction BWA variable has a negative coefficient of ξ =−.1306(SE = .0295); p < .0001. The log-odds ratios in Table 3.2 decrease by thisamount for each ordered age category. The large Wald value (χ2 =19.62; 1 df) for bwain Output 3.2 indicates the very important contribution that this parameter makes to Equa-tion 3.1.

The interpretation of Equation 3.3 is that the breathlessness and the wheeze are highlyrelated at all ages and that this relationship is strongest at younger ages. Any man whosuffers from one symptom is very likely to exhibit the other as well. At older ages this in-teraction of symptoms continues to be very large, but the breathlessness and the wheeze ap-pear less strongly related to each other. Therefore, breathlessness and wheezing are closelyrelated symptoms overall, but their appearance is indicative of a different type of lungdisorder in older and younger men.


After performing the present analysis, you can make marginal statements following alogistic regression that describes the risk factors for one symptom or the other, marginally.You can then be reasonably assured that any conclusion that you draw about one symptomis equally true about the other. Additional methods for modeling odds ratios in 2 × 2 tablesare given in Chapter 8.

3.3 Two Cross-Classified Ordered Categories

The previous example of coal miners contains only one ordered category. The examplestudied in this section has a two dimensional cross-classification in which both categori-cal variables have ordered categories. The interaction of two CLASS variables producesa large set of summary measures that are also explained here. The method of measuringassociation through the use of the CLASS statement does not have a simple interpreta-tion, however. Two other approaches, each using a single df parameter interaction effectthat simplifies this type of association between the two ordered categorical variables, aredemonstrated below. These different single df interaction parameters are also used to getpast the model of independence of rows and columns, but not so far as the saturated model.One of these models has the effect of fitting a model with an approximately constant oddsratio when all adjacent categories are combined to form a 2 × 2 table. The odds ratio isnearly the same no matter how this process of combining is applied. This interaction termis only meaningful when the row and column categories of a two-dimensional table areboth ordered. The other model is motivated by the manner in which the CLASS statementcreates models of interactions.

The data of Table 3.3 summarizes the outcome of a clinical trial of a study medicationused in patients following severe head trauma due to subarachnoid hemorrhage. This data isgiven by Agresti and Coull (1998). Both the treatment and outcome categories are ordered.

TABLE 3.3 Outcomes and treatments in a clinical trial for severe head trauma. The count indicated with‘*’ is unusually small.

Outcome

Treatment Vegetative Major Minor Goodgroup Death state disability disability recovery Total

Placebo 59 25 46 48 32 210Low dose 48 21 44 47 30 190Medium dose 44 14 54 64 31 207High dose 43 4∗ 49 58 41 195

Total 194 64 193 217 134 802

Reprinted from Computational Statistics & Data Analysis, Vol. 28, A. Agresti and B.A. Coull, “Order restrictedinference for monotone trend alternatives in contingency tables” pp. 139–55, c© 1998, with permission fromElsevier.

The drug was administered at three different doses and at a fourth, placebo level. Thefour ordered treatments are denoted as Placebo, Low, Medium, and High doses. The fivecategorical patient outcomes are also ordered in a natural level of increasing severity: GoodRecovery, Minor Disability, Major Disability, Vegetative State, and Death. The goal of thefollowing analysis is to determine whether or not the drug is effective, and then to clearlydemonstrate the magnitude of its effect.

An examination of this data using a CLASS statement for the treatment and outcomecategories can be used in the initial test of independence of these two variables. The model


fitted by the interaction of these CLASS variables, however, fits a saturated model that isdifficult to interpret. This model of independence is fitted using the SAS statements givenin Program 3.2. Output 3.3 contains the goodness of fit statistics for this model. The modelof independence fails to explain this data adequately: (χ2 = 25.03; 12 df; p = .015). Thatis, there appears to be a significant relationship or dependence between treatment type andoutcome that this model does not explain.

The following program fits all of the log-linear models described in this section to thetrauma data of Table 3.3. Portions of the output are given in Outputs 3.3 and 3.4. Fittedvalues are given in Tables 3.5 and 3.6.

Program 3.2 title1 ’Trauma and outcome. Data from Agresti and Coull, 1998.’;data;

input treat outcome freq common @@;label

treat = ’four treatments’outcome= ’five ordered outcomes’txo = ’treatment and outcome interaction’common = ’common class-style odds ratio’ ;

txo=treat*outcome; /* linear by linear interaction */datalines;1 1 59 1 1 2 25 1 1 3 46 1 1 4 48 1 1 5 32 -42 1 48 1 2 2 21 1 2 3 44 1 2 4 47 1 2 5 30 -43 1 44 1 3 2 14 1 3 3 54 1 3 4 64 1 3 5 31 -44 1 43 -3 4 2 4 -3 4 3 49 -3 4 4 58 -3 4 5 41 12

run;title2 ’Independence of treatment and outcome’;proc genmod; /* This step produces Output 3.3 */

class treat outcome;model freq = treat outcome / dist = Poisson;

run;title2 ’Interaction of a pair of class variables’;proc genmod; /* Output 3.4 is produced by this step */

class treat outcome;model freq = treat outcome treat * outcome / dist = Poisson;

run;title2 ’Fit the common odds ratio interaction’;proc genmod;

/* The fitted values from this step appear in Table 3.5 */class treat outcome;model freq = treat outcome common / dist = Poisson;

run;title2 ’Fit the treat * outcome linear by linear interaction’;proc genmod;

/* The fitted values from this step appear in Table 3.6 */class treat outcome;model freq = treat outcome txo /dist = Poisson;

run;title2 ’Common odds and linear by linear interaction effects’;proc genmod;

class treat outcome;model freq = treat outcome common txo / dist = Poisson;

run;





Deviance 12 27.7949 2.3162





This interaction between treatment and outcome can be fitted to this data using theinteraction of the two CLASS variables in the program:

proc genmod;class treat outcome;model freq = treat outcome treat*outcome/dist = Poisson;

run;

A portion of the output of this program is given in Output 3.4.





Error

Wald 95%Confidence


Intercept 1 3.7136 0.1562 3.4075 4.0197 565.42 <.0001

treat 1 1 -0.2478 0.2359 -0.7102 0.2145 1.10 0.2934

treat 2 1 -0.3124 0.2403 -0.7833 0.1585 1.69 0.1935

treat 3 1 -0.2796 0.2380 -0.7461 0.1869 1.38 0.2401

treat 4 0 0.0000 0.0000 0.0000 0.0000 . .

outcome 1 1 0.0476 0.2183 -0.3802 0.4754 0.05 0.8273

outcome 2 1 -2.3273 0.5238 -3.3540 -1.3006 19.74 <.0001

outcome 3 1 0.1782 0.2117 -0.2366 0.5931 0.71 0.3997

outcome 4 1 0.3469 0.2040 -0.0530 0.7468 2.89 0.0891

outcome 5 0 0.0000 0.0000 0.0000 0.0000 . .

treat*outcome 1 1 1 0.5642 0.3096 -0.0426 1.1710 3.32 0.0684

treat*outcome 1 2 1 2.0804 0.5879 0.9281 3.2327 12.52 0.0004

treat*outcome 1 3 1 0.1847 0.3127 -0.4282 0.7976 0.35 0.5549

treat*outcome 1 4 1 0.0586 0.3061 -0.5414 0.6586 0.04 0.8482

treat*outcome 1 5 0 0.0000 0.0000 0.0000 0.0000 . .

treat*outcome 2 1 1 0.4224 0.3191 -0.2030 1.0478 1.75 0.1856

treat*outcome 2 2 1 1.9706 0.5961 0.8023 3.1389 10.93 0.0009

treat*outcome 2 3 1 0.2047 0.3176 -0.4177 0.8272 0.42 0.5191

treat*outcome 2 4 1 0.1021 0.3102 -0.5060 0.7101 0.11 0.7421

treat*outcome 2 5 0 0.0000 0.0000 0.0000 0.0000 . .

treat*outcome 3 1 1 0.3026 0.3204 -0.3253 0.9305 0.89 0.3449

treat*outcome 3 2 1 1.5323 0.6149 0.3272 2.7375 6.21 0.0127

treat*outcome 3 3 1 0.3767 0.3092 -0.2292 0.9827 1.49 0.2230

treat*outcome 3 4 1 0.3780 0.2992 -0.2084 0.9644 1.60 0.2064

treat*outcome 3 5 0 0.0000 0.0000 0.0000 0.0000 . .

treat*outcome 4 1 0 0.0000 0.0000 0.0000 0.0000 . .

treat*outcome 4 2 0 0.0000 0.0000 0.0000 0.0000 . .

treat*outcome 4 3 0 0.0000 0.0000 0.0000 0.0000 . .

treat*outcome 4 4 0 0.0000 0.0000 0.0000 0.0000 . .

treat*outcome 4 5 0 0.0000 0.0000 0.0000 0.0000 . .

Scale 0 1.0000 0.0000 1.0000 1.0000


There are two problems immediately apparent in Output 3.4. First, this table is verydifficult to interpret. It is clear again from the initial test of independence that there areimportant relationships between the various treatment levels and the patients’ clinical out-


comes. Several of the interaction terms in Output 3.4 have a high statistical significance.Those that are not zero all have positive estimated values. There is no simple summary thatdescribes the findings from this table. Part of the difficulty is that the CLASS statementignores the ordering of the row and column categories in this data.

A difficulty with the interpretation of the parameters in Output 3.4 is further revealedin an explanation of the estimated values. Intuitively one would like to see the marginalresponse of treatment and outcome to be functions of only the marginal totals of the data.Similarly, interaction effects of treatment and outcome should only depend on the ‘interior’counts of the table. The interior of the table refers to the frequencies of specific treatmentand outcome combinations, as opposed to the marginal counts of these interior frequencies.As explained below, both the main and the interaction effects in Output 3.4 are all functionsof the interior of the table of counts. Therefore, the roles of these types of parameters is notstraightforward.

Second, all of the degrees of freedom have been used up in Output 3.4 and the goodnessof fit chi-squared is zero. This says that all of the information in the data is presented inOutput 3.4. This is clearly not a useful conclusion, however. The saturated model alwayshas a perfect fit but this defeats the aim of statistical modeling, which is to find reason-ably fitted models with a small number of parameters. So, you need to identify modelsthat describe the interactions between treatment and outcome, but do not use up all of theavailable df.

In order to justify using other models of the interactions between treatment and out-come for this data, it is also necessary to explain the estimated parameter values given inOutput 3.4. Using the CLASS statement makes the last row and column the reference cat-egories. The cell count in the last row and column (41 patients with High Dose and GoodRecovery outcome) plays a crucial role in all of the parameter values given in Output 3.4.All of the estimated parameter values involve comparisons with this reference count.

In particular, the intercept is the log (base e) of this count:

Intercept = log 41 = 3.7136.

Under the model of independence as modeled at Equation 2.3, the parameter estimatesfor the main effects of treatment and outcome are functions of the marginal sums alone. Inthis model, the interior of the table contains all of the information about the interactions oftreatment and outcome. When the CLASS statement is used to build the interaction effects,then the main effects parameters of treatment and outcome lose their role as models of themarginal sums and become part of the model for the interaction or interior of the table, asis demonstrated next.

In Output 3.4, the treatment parameters compare the other Good Recovery counts to the(Good recovery/High dose) reference cell. Specifically,

treat 1 = log(32/41) = −.2478

treat 2 = log(30/41) = −.3124

treat 3 = log(31/41) = −.2796

and the treat 4 parameter value is set to zero. Notice how these parameter values arefunctions of the interior of the table of counts and not of the marginal frequencies.

Similarly, the outcome parameters compare the High Dose counts to the reference cell.Specifically,

outcome 1 = log(43/41) = .0476

outcome 2 = log(4/41) = −2.3273

outcome 3 = log(49/41) = .1732

outcome 4 = log(58/41) = .3469


and outcome 5 is set to zero. Again, notice that these parameter estimates are functionsof counts in the interior of the table and not functions of the marginal totals of the data.

The treat*outcome interaction in Output 3.4 is a log-odds ratio that compares eachcell count in Table 3.3 to those in the corresponding last (Good recovery) column and last(High dose) row. Imagine creating a 2 × 2 table made up of any given cell in the table, thecorresponding cells in the last row and columns, and the reference 41 count in the lowerright corner. The values of the interactions are

treat*outcome 1 1 = log(59 × 41)/(32 × 43) = .5642

treat*outcome 2 1 = log(48 × 41)/(30 × 43) = .4224

treat*outcome 1 2 = log(25 × 41)/(32 × 4) = 2.0804

treat*outcome 2 2 = log(21 × 41)/(30 × 4) = 1.9706

and so on. These are log-odds ratios for 2×2 tables obtained from the counts of the originaldata of Table 3.3. There are no interaction effects measured for cell counts in the last rowand/or column of the table so these are set to zero in Output 3.4.

The main-effect parameters (treat and outcome in this example) contain some infor-mation about the interaction of the two variables. Notice the difference here with the modelof independence in which the main effect parameters model only the marginal totals, as inthe expos and tumor parameter variables in the fungicide example of Output 2.3.

To get beyond this model of independence and to more simply describe the relationshipbetween treatment and outcome without resorting to a saturated model, you need to utilizethe two different models described below.

The first interaction effect described here is based on the interactions built by theCLASS variable. Notice that all of the interaction effects given in Output 3.4 are positive,although only a few are statistically significant. You can fit a model in GENMOD in whichall of these interaction effects are equal to a single common value. Program 3.2 fits thiseffect and fits an interaction effect given in Table 3.4. Every row and every column in thisinteraction effect sums to zero. The estimated means for the model containing this effectare given in Table 3.5.

TABLE 3.4 The COMMON interaction effect that fits a constant log-odds ratio with the last row andcolumn of a 4 by 5 table. This model is suggested by the way in which the CLASS statement inGENMOD builds interaction effects.

1 1 1 1 −41 1 1 1 −41 1 1 1 −4

−3 −3 −3 −3 12

Program 3.2 fits a COMMON interaction effect. The common interaction effect is de-tailed in Table 3.4 and fits a model in which all CLASS-style interactions are set equal toa single common value. This is illustrated using the fitted values for this program given inTable 3.5. For example, the fitted value for the (1,1) cell of Table 3.5 has a CLASS-styleinteraction

log(51.54 × 41)/(44.72 × 32.17) = .39,

the (1,2) cell has a CLASS-style interaction of

log(17.04 × 41)/(32.17 × 14.75) = .39,

the (2,1) cell has a CLASS-style interaction of

log(46.73 × 41)/(29.11 × 44.72) = .39,

and so on.


TABLE 3.5 Expected counts for the trauma data of Table 3.3 fitting a constant CLASS-style interactioneffect. These values are provided by Program 3.2. The fit is not very good: (χ2 = 21.88; 11 df; p = .03).

Outcome


Placebo 51.64 17.04 51.38 57.77 32.17 210Low dose 46.73 15.41 46.48 52.27 29.11 190Med dose 50.91 16.79 50.64 56.94 31.71 207High dose 44.72 14.75 44.49 50.03 41 195

Total 194 64 193 217 134 802

The marginal sums of Table 3.5 are preserved and are the same as the marginal sumsof the original data in Table 3.3. The estimated COMMON effect for the expected countsin Table 3.5 is .0193 with an estimated standard error of .0104. This parameter providesa modest contribution (Wald χ2 = 3.42; 1 df; p = .06) to the model of independence oftreatment and outcome.

The chi-squared goodness of fit test for this model is (χ2 = 21.88; 11 df; p = .03),indicating a poor fit. The single outlying observed count of 4 in the High dose/Vegetativestate has an expected count of 14.75 and a Pearson residual of −2.80. This data pointmakes a large contribution to the lack of fit. Over one third of the value of the chi-squaredstatistic (= 21.88) can be traced to this single count (2.802 = 7.83).

Another 1 df interaction effect is necessary to get past the model of independence butnot so far as a saturated model. This model has a slightly better fit than the model witha constant CLASS-style interaction. It takes the multiplicative form described below. Aproperty of this additional parameter is that an approximately constant odds ratio is fittedto the table. Unlike the COMMON odds ratio parameter fitted above, this constant oddsratio appears when any combination of adjacent categories are combined or collapsed toform a 2 × 2 table.

Such models with a constant collapsed odds ratio were first proposed by Plackett (1965).The Plackett model is not log-linear so you can only approximately fit models with thisproperty. Nevertheless, the approximation is fairly accurate. The property of a constantodds ratio in the fitted values in Table 3.6 is demonstrated below.

TABLE 3.6 Expected counts for the trauma data of Table 3.3 using Equation 3.4. Table 3.9 demonstratesthat these values exhibit an almost constant log-odds ratio when adjacent categories are summed to form2 × 2 tables. The modest fit (χ2 = 18.11; 11 df; p = .08) can be traced to the unusual count identified inTable 3.3.

Outcome


Placebo 61.71 18.52 50.53 51.07 28.17 210Low dose 48.86 15.73 46.02 49.88 29.51 190Medium dose 46.13 15.93 49.97 58.10 36.87 207High dose 37.30 13.82 46.48 57.95 39.44 195

Total 194 64 193 217 134 802

The log-linear model that approximates the constant odds ratio is given by

log mi j = µ + αi + β j + γ hi j (3.4)


where

hi j = i × j. (3.5)

The hi j interaction effect is just the product of the row and column number for theseordered categories. This interaction is known as the linear by linear interaction in the studyof linear regression models. It is also useful in the present setting, however. The h effectcannot be constructed in the GENMOD MODEL statement as an interaction of two CLASSvariables because the ‘*’ symbol has a different meaning. This interaction is constructedin the DATA step of Program 3.2 and is given the variable name TXO.

The interpretation of the multiplicative hi j effect is that counts in the upper left andlower right are greater than what the model of independence would estimate when γ ispositive. In terms of the original data, this means that Placebo/Low dose treatments areassociated with poor outcomes and higher drug doses are associated with more favorableoutcomes. These combinations of categories have greater frequencies than those expectedby the model of independence. Similarly, negative values of γ fit smaller expected countsto cells with high dose/poor outcomes and low dose/good outcomes. The hi j parameteris denoted by the variable TXO in Program 3.2 to signify the product of treatment andoutcome categories. Section 4.3 contains another example of the use of the linear by linearinteraction.

The model of independence is obtained in Equation 3.4 by setting γ = 0, so this modeldiffers from independence by only 1 df. The CLASS statement in Program 3.2 serves tomodel the marginal sums of the treatment and outcome variables.

The estimated value of γ in Equation 3.4 is .070 with an estimated standard deviationof .023. The approximate statistical significance of γ and the hi j interaction effect is large(Wald χ2 = 9.56; 1 df; p = .002), which indicates the considerable contribution this 1 dfparameter makes to the model.

The log-linear model in Equation 3.4 has an acceptable fit to the data: (χ2 = 15.76;11 df; p = .15). The principal problem with the fit can again be traced to the group of 4patients treated at the high dose who remained in a vegetative state. (This data point wasalso identified as an outlier in the discussion of fitted values in Table 3.5.) The fitted valuesin Table 3.6 estimate that 13.82 patients should be expected in this category. This one datapoint has a Pearson residual of

(4 − 13.82)/13.821/2 = −2.64

and contributes 2.642 = 6.97 or more than one third of the value of the chi-squared statis-tic. No other data point fits this badly. The second largest Pearson residual of the fittedvalues in Table 3.6 has a value of 1.50.

A useful feature of Equation 3.4 is that an approximately constant odds ratio is fitted tothe data when adjacent ordered categories are combined to form a 2×2 table. Examples ofthe approximately constant odds ratios are demonstrated in Tables 3.7 and 3.8. Table 3.7combines adjacent categories in Table 3.6 to produce a 2 × 2 table. All levels of the drugtreatments and all the alive outcomes are summed. The odds ratio in this table is 1.44.Table 3.8 also collapses adjacent categories in Table 3.6 into a 2 × 2 table. The odds ratioin Table 3.8 is 1.41.

TABLE 3.7 The fitted expected counts of Equation 3.4. This 2 × 2 table is obtained by summing over theexpected counts in adjacent categories of Table 3.6. The odds ratio in this table is 1.44.

Dead All alive categories

Placebo 61.71 148.29All drug categories 132.29 457.71


TABLE 3.8 Equation 3.4 demonstrated in another 2 × 2 collapsed table from the expected counts inTable 3.6. The odds ratio in this table is 1.41.

Less favorable categories Good recovery

Placebo 181.83 28.17All drug categories 486.17 105.83

The odds ratios corresponding to all possible sums over adjacent ordered categories ofthe fitted values in Table 3.6 is given in Table 3.9. These odds ratios all fall into the narrowrange of 1.37 to 1.47.

TABLE 3.9 Odds ratios of all possible 2 × 2 tables obtained by combining all possible adjacentcategories in the fitted values of Table 3.6.

Outcome

Treatment Vegetative Major Minor Goodgroup Death state disability disability recovery

Placebo1.45 1.44 1.40 1.41

Low dose1.46 1.45 1.40 1.39

Medium dose1.47 1.46 1.39 1.37

High dose

You can also fit a model that includes both the common class-style and the hi j interac-tion effects. This model exhibits a good fit (χ2 = 15.68; 10 df; p = .15) that is about asgood as the model including the hi j interaction alone. In the log-linear model that includesboth interactions, only the h effect is statistically significant (p = .01). The COMMONCLASS-style interaction does not make an appreciable additional contribution (p = .7) tothe model. All of the log-linear models fitted in this section are summarized in Table 3.10.

TABLE 3.10 Summary of fitted log-linear models from Program 3.2 for the head trauma data of Table 3.3.

Overall Parameter

goodness of fit Common odds hi j (txo)

Model df χ2 p Wald p Wald p

Independence 12 25.03 .01Common odds 11 21.88 .03 3.42 .064hi j (txo) 11 15.76 .15 9.56 .002Common + hi j 10 15.68 .11 .14 .711 6.51 .011

The conclusion of this example is that the drug is effective and that higher doses areassociated with more favorable outcomes. The two single df parameters added to the modelof independence provide a simple way of demonstrating the relationship between treatmentlevels and outcome categories. A single outlying count appears in several models for thisdata. Other useful models for data with ordered categories such as in this example aredescribed by Agresti and Coull (1996, 1998).

68

Chapter

4Non-Rectangular Tables

4.1 Introduction 694.2 Independence in a Triangular Table 694.3 Interactions in a Circular Table 724.4 Bradley-Terry Model for Pairwise Comparisons 79

4.1 Introduction

One traditionally thinks of cross-classified categorical data as appearing in a rectangularformat. The data in Table 3.1 in Section 3.2 is an example. This type of rectangular data isalso called factorial because every level of every discrete valued variable is cross-classifiedwith all levels of all other variables.

This chapter contains three examples of data that is not rectangular. Models are pro-posed for these data sets and they are analyzed with log-linear models that are programmedin GENMOD. Non-rectangular tables result from structural zeros, or categorical cellswhere no observations are possible. The distinction should be made with observed zeroswhere no counts have been observed, but might be observed in a larger sample.

To put this distinction another way, consider a Poisson random variable X with mean λ.If λ > 0 then the event X = 0 is an observed zero that can occur with positive probabilitye−λ. On the other hand, if λ = 0 then a structural zero occurs because X = 0 is the onlyoutcome possible and this occurs with probability 1.

In the first example of this chapter, the model of independence holds for a triangularshaped table but some care must be exercised in order to interpret the fitted model. Thesecond example describes the locations of brain damage in stroke patients. The modelof independence for this circular shaped set of frequencies fails to explain the data well.Alternative models to independence are developed. They are chosen in such a way as to aidin the interpretation of the interactions in the data. The Bradley-Terry model is described inSection 4.4. This model is useful for pairwise comparisons resulting in square tables withomitted diagonals.

A different type of non-rectangular table is described in Chapter 6. Mark-recapture sur-veys take the form of a rectangular, factorial table with one unobservable cell. The aim ofthese surveys and their statistical analysis is centered around the estimation of the mean ofthis single missing cell. This chapter does not attempt to estimate the values in the struc-tural zero categorical cells and does not assume that the underlying table has any shapeother than that presented by the data.

4.2 Independence in a Triangular Table

Bishop and Feinberg (1969) present the data in Table 4.1, which summarizes the experi-ences of 121 stroke patients who were evaluated before and after treatment in a hospital.


The evaluation at admission and again at discharge classifies every patient as belonging toone of five ordered categories of disability. Patients were not discharged from the hospital(except by death) if their condition was worse than upon admission. That is, the data iscollected in such a manner that the treatment could not be shown to have an adverse effect.This ordering creates a triangular shaped set of frequencies.

TABLE 4.1 Admission and discharge status of stroke patients (A = least disability, E = greatest degree ofdisability). Patients were not discharged if their status deteriorated. Source: Bishop and Feinberg (1969).

Discharge status

A B C D E Totals

E 11 23 12 15 8 69Admission D 9 10 4 1 — 24

status C 6 4 4 — — 14B 4 5 — — — 9A 5 — — — — 5

Totals 35 42 20 16 8 121


Program 4.1 fits the log-linear model of independence of admission and discharge sta-tus. Parameter estimates and goodness of fit measures appear in Output 4.1. The expectedcounts under the model of independence of rows and columns are given in Table 4.2. It ispossible to calculate the expected counts by hand, along with the value of the chi-squaredstatistic, but these are best done using SAS.

Program 4.1 title1 ’Triangular stroke data’;data;

input count row $ col $;label

count = ’number of patients’row = ’admission status’col = ’discharge status’

;datalines;11 E A23 E B12 E C15 E D8 E E9 D A

10 D B4 D C1 D D6 C A4 C B4 C C4 B A5 B B5 A A

run;

Chapter 4 Non-Rectangular Tables 71

proc print;run;

proc genmod;class row col;model count = row col /* independence of rows and columns */

/ dist = Poisson obstats;run;




Deviance 6 9.5958 1.5993








Error

Wald 95%Confidence


Intercept 1 2.0794 0.3536 1.3865 2.7724 34.59 <.0001

row A 1 -1.1417 0.4923 -2.1067 -0.1768 5.38 0.0204

row B 1 -1.4294 0.3661 -2.1470 -0.7118 15.24 <.0001

row C 1 -1.2633 0.3009 -1.8531 -0.6735 17.62 <.0001

row D 1 -0.9328 0.2410 -1.4051 -0.4606 14.99 0.0001

row E 0 0.0000 0.0000 0.0000 0.0000 . .

col A 1 0.6717 0.4091 -0.1302 1.4736 2.70 0.1006

col B 1 1.0082 0.3973 0.2294 1.7869 6.44 0.0112

col C 1 0.3998 0.4267 -0.4365 1.2361 0.88 0.3488

col D 1 0.3614 0.4383 -0.4977 1.2205 0.68 0.4097

col E 0 0.0000 0.0000 0.0000 0.0000 . .

Scale 0 1.0000 0.0000 1.0000 1.0000



TABLE 4.2 Expected counts for the model of independence of admission and discharge status of strokepatients. The original data is given in Table 4.1, and Program 4.1 obtains these expected counts. Themodel fit is good (χ2 = 8.37; 6 df; p = .21), and is given in Output 4.1.

Discharge status

A B C D E Totals

E 15.66 21.92 11.93 11.48 8.00 69Admission D 6.16 8.63 4.69 4.52 — 24

status C 4.43 6.20 3.37 — — 14B 3.75 5.25 — — — 9A 5.00 — — — — 5

Totals 35 42 20 16 8 121

It is not difficult to calculate the df for this model. The model of independence in acomplete 5 × 5 table would have 16 df but Table 4.1 has 10 observations omitted by thesampling design. This results in 16 − 10 = 6 df.

The model of independence has a good fit to the data of Table 4.1: (χ2 = 8.37; 6 df;p = .21). You can conclude then, that discharge status is independent of admission statusexcept for the ordering of the categories. In terms of the therapy offered to these patients,this study was designed so that the treatment had to be shown as effective. The log-linearmodel of independence shows that the discharge status cannot be predicted on the basisof the admission status alone. At the same time, the degree of improvement cannot beanticipated knowing only the degree of disability upon admission.

4.3 Interactions in a Circular Table

Another study involving stroke victims describes locations in patients’ brains where lesionshave been recorded. Lesions are likely to occur at the same brain locations in future patientsas well. The shape of a cross section scan of the head results in a roughly circular shape. Anumber of interaction effects fitted to this data have a useful interpretation in terms of thephysiology of strokes.

Albert and McShane (1995) describe the frequency and location of brain lesions in 193stroke patients. The locations of each patient’s lesion was determined by a computer-aidedtomography (CAT) scan. A CAT scan is a continuous image but the data reported by Albertand McShane divides the brain into ten discrete categorical regions left to right (excludingthe midline) and eleven categories front to back. The frequencies of brain lesions in eachof these categories are given in Table 4.3.


TABLE 4.3 Location and frequencies of brain lesions in 193 stroke patients. Source: Albert andMcShane (1995).

Anterior

Left hemisphere Right hemisphere

0 1 0 0

1 1 1 0 0 4

3 1 0 1 0 2 9 11

1 4 3 0 1 0 3 11 15 11

2 2 3 3 0 1 0 1 8 13

13 7 2 5 6 3 1 0 9 12

15 9 4 — 4 0 — 8 7 9

6 8 5 1 0 1 1 8 8 7

4 4 2 1 2 5 7 7

6 4 2 1 2 5 8 7

3 1 1 2 5 5

Posterior


The two central regions indicated with ‘—’ marks identify ventricles where lesionscannot occur, and these data cells should be ignored. These structural zero count cells aretreated the same way as they were in Table 4.1 of the previous section.

Begin the modeling of this data by fitting the usual log-linear model of independence ofrows and columns for the mean mi j of the count ni j in the (i, j) cell

log mi j = µ + αi + β j

for all pairs of indices (i, j) restricted to the locations of valid counts in Table 4.3. Thecode to fit this model is given in Program 4.2.

This program fits the log-linear model with the three interaction effects to the data inTable 4.3. The chi-squared Reschi residuals are plotted against jittered rows producingthe plot in Figure 4.1.

Program 4.2 data stroke;input count row col @@;h=(col-5.5)*(row-6); /* h asymmetry */b= 1/(1+(col-5.5)**2 +(row-6)**2); /* center bump */ob= 1/(1+(col-4.5)**2 +(row-6)**2); /* offset bump */e=sqrt(col**2+row**2); /* distance from center */rowc = row; /* row category */colc = col; /* column category */jr=row+normal(0)/10; /* jitter row */label

count = ’number of patients’colc = ’left to right category’rowc = ’front to back category’e = ’distance from center’h = ’h interaction’b = ’bump in the center’


ob = ’offset bump’jr = ’jittered row’ ;

datalines;0 1 4 1 1 5 0 1 6 0 1 7 1 2 3 1 2 4 1 2 5

. . . more data. . .

1 10 5 2 10 6 5 10 7 8 10 8 7 10 9 3 11 3 1 11 41 11 5 2 11 6 5 11 7 5 11 8

run;proc genmod; /* model of independence for rows and columns */

class rowc colc;model count =rowc colc / dist=Poisson;

run;/* Fit model for indep, h, e, ob and obtain residuals */

proc genmod;class rowc colc;model count = h e ob rowc colc / dist=Poisson obstats;ods output obstats=fitted; /* Capture residuals */

run;data origin; /* Drop class variables */

set stroke;drop rowc colc;

run;data both; /* Merge observed and fitted values */

merge origin fitted;run;proc gplot; /* Plot chi-squared residuals by jittered row */plot reschi * jr / vref=0 haxis=axis1 vaxis=axis2;

run;

The values of αi and β j are subject to identifiability constraints as described in Sec-tion 2.2. There are 88 independent Poisson distributed counts in Table 4.3. The model ofindependence of rows and columns estimates 20 parameters: one intercept µ; 10 α’s forthe 11 rows; and 9 β’s for the 10 columns. There are then 88 − 20 = 68 df associated withthe model of independence in this table.

This model of independence does not fit well: (χ2 = 101.26; 68 df; p = .006). Theinteraction of these CLASS variables, as described in Section 3.3, is not particularly usefulfor this data because there is no clear choice for a reference cell, row, or column. Asdemonstrated in Output 3.4, the interaction of CLASS variables uses up all of the df, whichfits a saturated model with parameters that are sometimes difficult to interpret. A betterapproach is to develop a well-fitting model that has a small number of easily interpretedparameters.

The following three interaction effects improve upon the independence model. The firstof these is the hi j interaction illustrated in Table 4.4. This multiplicative interaction isprogrammed in a similar manner to the way it is used at Equation 3.5 in Section 3.3. In thepresent data set, define

hi j = (i − 6)( j − 5.5) (4.1)

where the values of 6 and 5.5 are the respective median row and column index. In thelanguage of linear models and design of experiments, hi j is the linear-by-linear interaction,centered on each of the two axes.


TABLE 4.4 The hij linear-by-linear interaction effect defined at Equation 4.1 for the circular stroke data.This interaction models asymmetry in the left/right and anterior/posterior combined regions.

Anterior


7.5 2.5 −2.5 −7.5

10.0 6.0 2.0 −2.0 −6.0 −10.0

10.5 7.5 4.5 1.5 −1.5 −4.5 −7.5 −10.5

9.0 7.0 5.0 3.0 1.0 −1.0 −3.0 −5.0 −7.0 −9.0

4.5 3.5 2.5 1.5 0.5 −0.5 −1.5 −2.5 −3.5 −4.5

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

−4.5 −3.5 −2.5 — −0.5 0.5 — 2.5 3.5 4.5

−9.0 −7.0 −5.0 −3.0 −1.0 1.0 3.0 5.0 7.0 9.0

−10.5 −7.5 −4.5 −1.5 1.5 4.5 7.5 10.5

−14.0 −10.0 −6.0 −2.0 2.0 6.0 10.0 14.0−12.5 −7.5 −2.5 2.5 7.5 12.5

Posterior

Centering the h interaction in Equation 4.1 makes it easier to visualize its effect. Thegreatest effect of the h interaction occurs further from the center of the table. This centeringis in contrast to defining h as the product of i and j as at Equation 3.5 in Section 3.3. Theinterpretation of h in this data is to model asymmetry in the front to back and left to rightcategories. The fitted means m are the same, whether we center h or not.

When the h product-interaction parameter given at Equation 4.1 is added to the modelof independence, the fit is improved (χ2 = 91.37; 67 df; p = .0256), but it is still far fromadequate. The estimated regression coefficient of h in this model is −.028; (SE = .0086;Wald χ2 = 10.86; p = .001). This interaction is statistically significant and the singledf parameter makes a large contribution to the model and the overall fit. The negativecoefficient indicates that lesions are rare in the left/anterior and right/posterior regionsrelative to the model of independence. Some other models for this data (summarized inTable 4.6) have the reversed sign of the estimated coefficient of h. The interpretation ofthe coefficient of h is still the same: to model asymmetry of circular regions centered inthe middle of the head. The larger the distance from the center of the head, the greater theeffect of the h interaction.

Even though h is statistically significant in modeling asymmetry of the model just de-scribed, the addition of the next two parameters models symmetric, circular patterns.

The second interaction effect useful in modeling this data is a measure of the distancefrom the center of the brain. This measure is called e for the Euclidean distance and isdefined as

e =√

(i − 6)2 + ( j − 5.5)2.

In the fitted models summarized in Table 4.6 you can see that the coefficient for e ispositive, indicating that locations further from the center of the brain are at greater riskfor stroke lesions than those at the center. These positive estimated coefficients for thee term appear in models that also include the row and column effects of the model ofindependence.

A third interaction effect used in the modeling of this data serves to fit the large positiveresiduals in the center of the circular table after fitting a log-linear model with the indepen-dence, h, and e terms. To fit these positive residuals at the center, introduce a regression


effect b (as in ‘bump’). The b effect is defined as

b = 1/{1 + (i − 6)2 + ( j − 5.5)2}The interaction produces a gentle rise in the center of the stroke frequency data. The valuesfor the b interaction are given in Table 4.5. The two central categorical cells have values of.8 and the values of b fall off for categories further away from the center.

TABLE 4.5 The bump effect b in the stroke data. The largest values of this effect appear in the center andfall off towards the edges.

Anterior


.035 .038 .038 .035

.043 .052 .058 .058 .052 .043

.045 .062 .082 .098 .098 .082 .062 .045

.040 .058 .089 .138 .190 .190 .138 .089 .058 .040

.045 .070 .121 .235 .444 .444 .235 .121 .070 .045

.047 .075 .138 .308 .800 .800 .308 .138 .075 .047

.045 .070 .121 — .444 .444 — .121 .070 .045

.040 .058 .089 .138 .190 .190 .138 .089 .058 .040

.045 .062 .082 .098 .098 .082 .062 .045

.034 .043 .052 .058 .058 .052 .043 .034

.031 .035 .038 .038 .035 .031

Posterior

A slightly off-centered bump effect is even more useful in the data analysis. The off-centered bump (ob) is defined by

ob = 1/{1 + (i − 6)2 + ( j − 4.5)2}This effect models a rise in frequencies whose peak is centered one column category tothe left of center. The values of ob are almost identical to those of b given in Table 4.5except that the peak is moved one column category to the left. The role of the b and obterms is almost the same and there is small benefit in having them both appear in the samelog-linear model. The interpretation of having both b and ob in the model is to approximatewhere the actual central peak frequencies are located. Estimation of the exact location ofthe peak parameters is difficult because these two parameters do not correspond to a log-linear model. That is, the row and column categories of a peak are not linear in the log ofthe Poisson means.

You can fit all possible combinations of log-linear models involving the h, e, b and obeffects to the data in Table 4.3. Program 4.2 fits two of these models. The goodness of fitfor a wide variety of the possible models are summarized in Table 4.6.

Every appearance of a model containing the ob term in Table 4.6 is accompanied bytwo sets of df. The exact location of the peak of the ob was not estimated but rather judgedby eye. If the row and column corresponding to the peak were estimated then these twoadditional parameters would have resulted in a loss of 2 df. It could be argued then, thatsubstituting the ob terms for the b term in a log-linear model should result in a loss of 2df. In either case, the p-value is only slightly changed for every model in Table 4.6 thatcontains the ob whether or not you include this 2 df charge for identifying the two locationparameters for the effect. Log-linear models that contain both the b and the ob terms haveestimated regression coefficients with opposite signs but the ob term is always larger anddominates.


The models summarized in Table 4.6 give estimated coefficients with the correspondingstandard errors in parentheses. Every model in this table also contains the row and columnparameters of the model of independence. The model containing parameters for indepen-dence, e, and ob has a reasonable fit (p = .13 or .18), as does the model containing theindependence, h, and ob terms (p = .11 or .15).

TABLE 4.6 Goodness of fit statistics and fitted parameter values (with SE) for various log-linear models ofthe circular stroke data of Table 4.3. All models contain the parameters for the model of independence ofrow and column categories. Every model containing the ob term is listed with two sets of df, as describedin the text.

Parameter estimate (SE)

h e b ob χ2 df p-value

Independence 101.26 68 .006

One interaction parameter models

−.028(.009) 91.37 67 .026.578(.165) 89.70 67 .034

1.68(.75) 95.41 67 .013

2.35(.66) 88.95{ 67 .038

65 .026

Two interaction parameter models

.169(.076) 3.76(1.46) 86.58 66 .046−.028(.008) 1.69(.75) 85.80 66 .051−.031(.009) 2.57(.65) 77.93

{ 66 .14964 .113

.572(.162) 1.70(.75) 84.05 66 .066

.620(.165) 2.53(.65) 76.62{ 66 .175

64 .134

−.34(1.1) 2.55(.93) 88.98{ 66 .031

64 .021

Three interaction parameter models

.178(.076) 3.94(1.46) 1.81(.74) 80.36 65 .095

−.03(.009) −.72(1.1) 3.01(.94) 77.87{ 65 .132

63 .098

.63(.17) −.61(1.1) 2.91(.94) 76.61{ 65 .154

63 .116

.122(.077) 2.92(1.47) 2.34(.67) 74.94{ 65 .187

63 .144

Four interaction parameter model

.119(.079) 2.87(1.51) .16(1.14) 2.44(.99) 74.96{ 64 .164

62 .125

The log-linear model that ends this section contains parameters for independence plusthe h, e, and ob terms. This is the last model with three interaction parameters listed inTable 4.6. The fitted cell means m or Pred are given in Table 4.7. This model has a rea-sonable fit (p = .14 or .19). Each of the interaction terms h, e, and ob appear to make astatistically significant contribution to the model.

A plot of Reschi against jittered rows is given in Figure 4.1. Jittering is a processof adding a small amount of random noise to the discrete valued rows that improves thevisual appearance. Without jittering this plot would consist of a set of 11 vertical ‘stripes’.


TABLE 4.7 The fitted means (Pred values) for the log-linear model containing the terms for the model ofindependence, h, e, and ob. This is the last model listed in Table 4.6 containing three interaction termslisted. The Reschi residuals are plotted in Fig. 4.1.

Anterior


.10 .15 .19 .56

.39 .28 .40 .49 1.44 4.00

1.55 1.16 .69 .87 .99 2.79 7.68 11.27

4.28 3.08 2.02 1.10 1.23 1.18 2.98 7.82 11.42 13.89

4.23 3.00 2.03 1.41 1.41 .87 1.79 4.36 6.24 7.66

8.37 6.04 4.39 5.46 5.04 1.49 2.57 5.96 8.37 10.30

11.27 8.13 5.10 — 2.74 1.50 — 6.59 9.24 11.44

8.84 6.38 3.47 1.46 1.24 .96 2.06 4.92 6.96 8.71

7.37 3.87 1.50 1.28 1.06 2.42 5.94 8.54

7.46 4.01 1.56 1.35 1.15 2.71 6.80 9.95

3.55 1.43 1.28 1.12 2.69 6.93

Posterior

Figure 4.1Chi-squared residuals (Reschi) plotted against jittered rows for the stroke data of Table 4.3.The fitted model contains the parameters for the model of independence plus the h, e, andob effects.

2 4 6 8 10

Jittered row

−2

0

2

Reschi

t

tt

tttt t t tt tt

t t t tt

tt tt ttttt ttttt

t tt t tt tt ttt

tt t t tt t

tt tt t tt t tt t t t

ttt tt tt t

tttt ttt

ttt t t t

tt

t

t


Some of the residuals are very close in value and would appear as a single point in theplot without jittering. Jittering often helps to identify any pattern in the randomness of theresiduals when the variable they are plotted against takes on a small number of discretevalues. Too much jittering might distort the picture, of course. The amount of jitteringshould be just enough to produce enough noise to separate some points yet enable thereader to recover the exact row membership of each individual data point.Jittered rows are produced in the DATA step by introducing a new variable

jr = row + normal(0)/10;

in Program 4.2.The normal(0) function produces a standard normal (mean zero, unit variance) random

variate. The rannor(0) function works in the same manner. These normal values aredifferent every time this statement is executed and every time you run your SAS program.A standard deviation of 1/10 produces just enough random noise to the rows to be useful.Larger values than .1 produce too much noise and might make it difficult for the reader torecover the original, unjittered rows.

Only two of the chi-squared residuals in Figure 4.1 are greater than 2 in absolute magni-tude. One of these two extreme residuals is located in a remarkable position in the data. Thelargest positive residual appears in the top row of Table 4.3 containing the only non-zerocount. The largest negative residuals corresponds to the zero count occurring in column 8,row 6.

Stroke lesions are the result of extreme changes both up and down in blood pressure inthe brain. Carotid arteries are the main sources of blood to the brain. These arteries runup from the heart to the brain just inside the ears and are a source of stroke because theblood is under greatest pressure at these points. Another source of lesions is damage tothe blood vessels farthest from these blood sources. At these points near the center of thebrain, the blood pressure is lowest and subject to perturbations in the distant supply. Thatis, the locations of greatest potential stroke damage are the center of the brain as well asjust inside the ears. The column and e effects in the model of independence capture mostof the stoke frequencies close to the ears. The ob offset ‘bump’ effect models the rise infrequencies of lesions near the center of the brain but seem to be just slightly to the leftof the exact center. The result of fitting the e and ob effects is to build a hat-shaped set ofexpected counts with a raised center and brim. The h effect models an asymmetric pattern.The frequencies of stroke lesions do not appear in symmetric, circular patterns centered inthe middle of the brain. The log-linear models summarized in Table 4.6 containing the eand ob terms fit just as well with and without h interaction modeling asymmetry.

4.4 Bradley-Terry Model for Pairwise Comparisons

A common strategy for selecting the best of several items or ranking a group of items isto have them compared in a pairwise manner. It would be difficult, for example, for you topick your favorite out of the dozens of available brands of toothpaste or shampoo. Instead,your decision would be much easier if you only had to compare one pair at a time andpick the better of the pair. This is often the way that marketing surveys for new productsare performed. The frequency with which this type of data appears in practice has led tothe development of a wide variety of ways in which it is analyzed. A popular method isdescribed here and applied to an example of win/loss records of baseball teams.

The data in Table 4.8 gives the win/loss records for seven American League baseballteams in the 1987 season for games played in the Eastern Division. Games played outsidethe Eastern Division have been omitted. Each team plays each of the others but obviouslycannot play itself. So the diagonal entries of this square table are omitted. There are 42


separate counts in Table 4.8. The row sums in this table give the number of games wonby each team and the column sums give the number of games lost. In this season, everyteam played a total of 78 division games. For every team, the row sum plus the columnsum must add up to 78. This degree of balance is not required, in general, for the methodsof this section to apply.

TABLE 4.8 Win and loss records for 1987 American League, Eastern Division baseball teams. The teamsare ordered by their number of games won and lost. Source: Agresti (1990, p. 372).

Losing teamTotal

Winning team Milw. Det. Tor. N.Y. Bos. Clev. Balt. games won

Milwaukee — 7 9 7 7 9 11 50Detroit 6 — 7 5 11 9 9 47Toronto 4 6 — 7 7 8 12 44New York 6 8 6 — 6 7 10 43Boston 6 2 6 7 — 7 12 40Cleveland 4 4 5 6 6 — 6 31Baltimore 2 4 1 3 1 7 — 18

Totalgames lost 28 31 34 35 38 47 60

Used with permission: John Wiley & Sons, Inc.

Let ni j denote the count in the (i, j) cell. Each ni j is the number of times that team ibeats team j . A closer examination of Table 4.8 shows that every team plays every otherteam exactly 13 times so every team plays a total of 13 × 6 = 78 games. For every pair ofmatches between teams i and j we then have

ni j + n ji = 13. (4.2)

There are(7

2

) = 21 constraints of this type that the modeling must reflect.In general, there is no reason that every team (or toothpaste) be compared with every

other an equal number of times. In some examples of this type of survey with many items,some pairs of items might never be compared to each other.

When you model this data, the number of comparisons between items i and j mustbe constrained according to Equation 4.2. The pair of counts ni j and n ji then represent abinomially distributed pair with index (sample size) equal to the number of comparisonsni j + n ji made between these two items.

Let πi j represent the probability that team i beats team j and π j i = 1 − πi j denotethe probability that team j beats team i . The method described next models the binomialprobability parameters πi j .

Bradley and Terry (1952) propose a log-linear model for making paired comparisons ina setting such as this. The Bradley-Terry model specifies that there are parameters φi suchthat

log{πi j/(π j i )} = log(πi j/(1 − πi j )) = φi − φ j . (4.3)

Interpret φi as the ability or strength of item i on a log-linear scale. The log-odds thatteam i beats team j is φi − φ j or the difference of their strength parameters. The log-oddsof the πi j ’s is linear in the φi parameters so the Bradley-Terry model is also a logisticmodel. Program 4.3 fits this model in SAS.


The following program fits the Bradley-Terry model to the baseball data of Table 4.8.The goodness of fit and fitted parameters are given in Output 4.2.

Program 4.3 %let max=7; /* maximum number of items to compare */%let cons=13; /* number of pairwise comparisons performed */title1 ’Baseball games using the Bradley Terry model’;data bball;

input wteam $ 1-10 wtmn ltmn freq;label

wteam = ’name of winning team’wtmn = ’winning team number’ltmn = ’losing team number’freq = ’number of wins’ ;

jitwin=wtmn+normal(0)/10; /* jitter winning team number */if ltmn < wtmn then delete; /* omit top half of table */comps=&cons; /* number of times each team plays others */array ph(&max) phi1-phi&max; /* build phi parameters */do j=1 to &max;

ph(j)=0; /* initialize all to zero */end; /* differences of phi’s for comparisons: */ph(wtmn)=+1; /* +1 for winning team */ph(ltmn)=-1; /* -1 for losing team */datalines;

Milwaukee 1 2 7Milwaukee 1 3 9


Baltimore 7 6 7run;

ods output obstats=fitted;proc genmod; /* Fit Bradley-Terry model */

model freq / comps = phi1-phi7/ dist = binomial noint obstats;

run;data two;

merge fitted bball; /* combine fitted and raw data */output; /* output lower half of table */j=wtmn; wtmn=ltmn; ltmn=j; /* rebuild upper half of table */reschi=-reschi; pred=13-pred;jitwin=wtmn+normal(0)/10;output;drop phi1-phi7;

run;proc gplot;

plot Streschi * jitwin;run;quit;





Deviance 15 15.7365 1.0491




Log Likelihood -172.2482




Error

Wald 95%Confidence


Intercept 0 0.0000 0.0000 0.0000 0.0000 . .

phi1 1 1.5814 0.3433 0.9086 2.2541 21.22 <.0001

phi2 1 1.4364 0.3396 0.7709 2.1020 17.89 <.0001

phi3 1 1.2945 0.3367 0.6346 1.9543 14.78 0.0001

phi4 1 1.2476 0.3359 0.5893 1.9059 13.80 0.0002

phi5 1 1.1077 0.3339 0.4533 1.7621 11.01 0.0009

phi6 1 0.6839 0.3319 0.0334 1.3343 4.25 0.0393

phi7 0 0.0000 0.0000 0.0000 0.0000 . .

Scale 0 1.0000 0.0000 1.0000 1.0000


The φi parameters in the Bradley-Terry model (Equation 4.3) are not identifiable.Specifically, the model is unchanged if a constant is added to all of the φi parameters.Similarly, there is no need to include an intercept in this model.

The estimated values φ of the φ are given in Table 4.9. These parameters are constrainedso that the φ of the weakest team is set to zero for identifiability. That is, φ7 is identicallyzero. In this manner, all teams can easily be compared to the weakest team. Differences ofthe φ parameters provides an estimate of the relative strengths of any pair of teams. So, forexample, φ1 − φ2 estimates the difference in abilities between the two strongest teams.

The phi1-phi7 parameters in Program 4.3 are the φi values given above. The seven φ

parameters and the 21 constraints modeling the number of pairwise comparisons in Equa-tion 4.2 are fitted by the

model freq / comps = phi1-phi7 / dist = binomial noint obstats;

statement.


TABLE 4.9 Fitted win and loss records for 1987 American League baseball teams using the Bradley-Terrymodel. Program 4.3 obtains these values.

Losing team

Winning team Milw. Det. Tor. N.Y. Bos. Clev. Balt. Totals

Milwaukee — 6.97 7.42 7.57 8.01 9.23 10.78 50Detroit 6.02 — 6.96 7.11 7.56 8.84 10.50 47Toronto 5.57 6.04 — 6.65 7.11 8.42 10.20 44New York 5.43 5.89 6.35 — 6.95 8.29 10.09 43Boston 4.99 5.44 5.89 6.05 — 7.86 9.77 40Cleveland 3.76 4.16 4.57 4.71 5.14 — 8.64 31Baltimore 2.22 2.50 2.80 2.90 3.23 4.36 — 18

Totals 28 31 34 35 38 47 60

The variable COMPS is always equal to 13, the number of times each team plays eachof the others. The binomial distribution is constrained by this number. The upper-right halfof the data in Table 4.8 is redundant because all of those values can be calculated from thelower-left half. The DATA step at the start of Program 4.3 deletes this top half of the databefore GENMOD fits the log-linear model. The subsequent DATA step reconstructs this

Figure 4.2Binomial StReschi residuals of the Bradley-Terry model plotted against jittered winningteam. Teams are ordered left to right by decreasing percentage of games won.

Mil Det Tor NY Bos Clev Balt

Jittered winning team

−2

0

2

StReschi

Boston

Detroit

t

t

t

t

t

t

tt

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t


omitted half in order to plot the StReschi residuals of the full data. These residuals arediscussed in Section 2.4.

This model fits the data very well: (χ2 = 14.61; 15 df; p = .48). The residuals areplotted against the jittered winning team number in Figure 4.2. The teams are ordered leftto right in order of decreasing percent of games won. The residuals of this figure are theStReschi values produced by GENMOD. These are the chi-squared residuals, normalizedto unit variances by taking into account the estimation of the φ parameters.

Every residual in Figure 4.2 for an observed count ni j has a ‘mirror’ with an oppositesign corresponding to the count n ji = 13 − ni j . There are no extreme outliers revealedby Figure 4.2. Two of the StReschi residuals are greater than 2 in magnitude. These tworesiduals correspond to the pair of teams Boston and Detroit. Boston played very badlyagainst the slightly better team from Detroit, losing 11 of their 13 matches. Similarly, thecorresponding residual shows that Detroit won too many games against Boston. You cansee in Table 4.9 that the Detroit team was expected to win only 7.5 of their 13 gamesagainst Boston.

The StReschi residuals for the team with the best record, Milwaukee, are small, in-dicating a consistent pattern of play throughout the baseball season. There is a slight ten-dency for teams with poorer records to have a greater variability of their residuals, perhapsreflecting a greater variability in their performance over the baseball season.

Chapter

5Poisson Regression

5.1 Introduction 855.2 Poisson Regression for Mortality Data 875.3 Poisson Regression with Overdispersion 92

5.1 Introduction

Linear regression methods have traditionally been popular for analysis of data with nor-mally distributed errors. Analogous methods for Poisson distributed data are also usefuland are available using the GENMOD procedure. Methods for model building and test-ing the significance of the individual terms in regression models parallel the methods fornormally distributed errors, except that there is no analysis of variance in Poisson regres-sion. The Poisson variance is equal to the mean so these two parameters are not examinedseparately. When you suspect overdispersion, you perform a separate examination of thePoisson variance. An example with overdispersion in Poisson regression is examined inSection 5.3.

Consider the example of the data given in Table 5.1. This table gives the numbers ofmen in Japan who died of testicular cancer. The data is categorized by the men’s ages (in 5year intervals) and year intervals of the deaths. Also given is the population in thousands.This population figure provides the number of individuals at risk for cancer in each age andyear group. The number of deaths in each age/year combination should be related to thepopulation in that group. A Poisson distribution for the numbers of deaths is appropriatefor this data because of the large population at risk and the extremely rare nature of theevent. This Poisson approximation of the binomial distribution is described in Section 1.3.The two missing values in Table 5.1 are omitted from the analysis in Section 5.2.

A simple example of Poisson regression is given in Section 2.6.2 in which the numberof lottery winners in various towns is modeled as a multiple of the town’s population. Thatsection describes the likelihood function that is maximized in order to estimate regressioncoefficients.

Linear models are common in applied statistics, but, by themselves, they are not naturalto Poisson regression. Instead, most models for Poisson data are linear in the log-meansrather than linear in the means. In log-mean models the effects of covariates are multi-plicative rather than additive. Poisson regression is a generalization of log-linear modelsfor which the covariate information describes membership in a categorical cell.

Let Yi denote a collection of independent Poisson counts with respective mean param-eters λi . The method called Poisson regression seeks to model the λi as functions of theadditional covariate values xi , which are thought to influence the λi . The requirement thatall of the λi be non-negative restricts the use of models that express λi as a linear func-tion of x. Even if the final model expresses the Poisson means as positive quantities, thereis no guarantee that GENMOD will not accidentally try some combination of regressionparameters in the fitting process that results in a negative λi at some intermediate stage.


TABLE 5.1 Deaths in Japan due to cancer of the testis by age, year, and population in thousands. The values indicated by ‘*’ are missingand have been omitted from the analysis. Source: Lee et al. (1973), Andrews and Herzberg (1985, pp. 233–5).

1947–49 1951–55 1956–60 1961–65 1966–70

Age pop. deaths pop. deaths pop. deaths pop. deaths pop. deaths

0 15501 17 26914 51 21027 65 20246 69 21596 745 14236 * 25380 6 26613 7 20885 8 20051 7

10 13270 * 23492 3 25324 3 26540 7 20718 1115 12658 2 21881 6 23211 15 24931 25 26182 3920 10696 5 20402 27 21263 39 22228 56 24033 8325 7563 5 17242 40 19994 58 20606 97 21805 12530 7074 7 12609 18 17128 54 19864 77 20750 12935 7038 10 11712 13 12476 36 17001 70 19890 10140 6418 9 11478 26 11450 32 12275 29 16794 6745 5981 7 10274 16 11157 26 11147 34 11962 3750 4944 7 9325 16 9828 27 10705 27 10741 2955 3994 7 7562 17 8718 19 9206 32 10086 3960 3098 6 5902 13 6796 21 7869 21 8399 3165 2317 4 4244 12 4911 26 5728 29 6715 3470 1513 7 2845 17 3197 22 3737 25 4448 3375 688 5 1587 9 1812 10 2061 25 2482 3180 264 2 583 6 787 6 904 14 1068 985 73 2 179 2 246 3 335 3 419 3


To avoid such problems, it is customary to model the logarithm of the mean parameteras a linear function of covariates. That is,

log(λi ) = β′xi

so that

λi = exp(β′xi )

can never be negative, regardless of the values of β or xi .The logarithm is not the only way you can model λi and keep it positive but this method

is the most common. Models that are log-linear or linear in the logs of the Poisson meansare discussed in Sections 2.2 and 2.3. The mathematical form of the Poisson likelihoodfunction (described in Section 2.6) lends itself to models of λi that are linear in their logs.

The logarithmic relationship between the Poisson mean and the linear function of co-variates x is called the link function in the generalized linear models that are the basis ofthe GENMOD procedure. This log link is the default option in GENMOD when you arespecifying Poisson distributed data.

You program Poisson regression in GENMOD through the MODEL statement. For ex-ample, the following SAS code illustrates the usual form in which Poisson regression isperformed:

proc genmod;model y = x1 x2 x3 / dist=Poisson;

run;

The logarithmic link function (link=log) is assumed and need not be specified whenthe Poisson distribution is used. In other words, the specification of the LINK function inthe following code is redundant.

Chapter 5 Poisson Regression 87

proc genmod;model y = x1 x2 x3 / dist=Poisson link=log;

run;

Use the identity link to fit a model that is linear in the mean (as opposed to linear in thelog of the mean) as follows:

proc genmod;model y = x1 x2 x3 / dist=Poisson link=identity;

run;

Use this option with caution because fitted values might result in negative estimates forthe Poisson mean. The identity value specifies that the Poisson mean is to be modeleddirectly as a linear function of covariates.

Since this point is already made in Chapter 2, there is little new in this chapter as far asthe SAS programming is concerned, except for the complexity of the examples discussed.Two examples are examined in this chapter: the cancer deaths of Table 5.1 mentionedabove, and an example that measures species diversity on the Galapagos Islands as a func-tion of a variety of covariate values.

5.2 Poisson Regression for Mortality Data

Let λi j denote the mean of the count for the i th age and j th year category in the mortalitydata of Table 5.1. The log-linear model for independence of age and year is

log λi j = µ + αi + β j .

In addition to the age and year effects, there is also a cohort effect that can be modeled.Individuals at risk of cancer in one age and year category are also included in the next ageand year category five years later. Except for the first time interval of Table 5.1, the yearsand ages are all for half-decade intervals. If we treat the first column of the table as a fiveyear interval, then there is a cohort effect in which individuals in the same diagonal areborn in the same five year interval.

The effects of the various cohorts in Table 5.1 can be modeled using a term in a log-linear model. Constant groups of cohorts appear in the diagonal lines of categories fromthe upper left to the lower right in the table. All indices in λi j with the same value of i − jrepresent the same cohort. You can then create a cohort effect for each value of i − j thatwill be used as a CLASS variable in a Poisson regression model.

This produces a log-linear model of the form

log λi j = µ + αi + β j + θi− j . (5.1)

where values of . . . , θ−1, θ0, θ1, . . . are parameters that represent the effects of the variouscohorts in the data set.

The parameters of Equation 5.1 are estimated in a log-linear model in Program 5.1.Output 5.1 contains partial output for Equation 5.1. Although the fit of the model is good(χ2 = 29.81; 46 df; p = .97), this model does not use the population data. Some of thesignificance levels in Output 5.1 are distorted because of this.

The following program fits Poisson regression models for the cancer data of Table 5.1.It fits Equation 5.1 and 5.2 and produces the plot of Figure 5.1. Outputs 5.1 and 5.2 containgoodness of fit and significance levels for these models.


Program 5.1 title1 ’Deaths from testicular cancer in Japan’;data tcancer;

input age pop1 c1 pop2 c2 pop3 c3 pop4 c4 pop5 c5;array p(5) pop1-pop5; /* populations at each year group */array d(5) c1-c5; /* cancer cases for each year group */label

agegrp = ’age in 5yr interval’logpop = ’log-population’;

/* Produce a separate line for each year/age combination */agegrp=age/5; /* age in five year intervals */do j=1 to 5; /* for each line read in . . */

deaths=d(j); /* number of cancer deaths */logpop=log(p(j)); /* log of population */year=j-1; /* recode the years as 0,...,4 */yearc=year; /* year category */cohort=agegrp - year + 4; /* identify the diagonal cohort */output; /* produce five output lines for each read in */

end;drop j pop1-pop5 c1-c5; /* omit unneeded variables */datalines;

0 15501 17 26914 51 21027 65 20246 69 21596 74


85 73 2 179 2 246 3 335 3 419 3run;title2 ’Equation 5.1: Cohort, age, year; No population information’;proc genmod;

class yearc agegrp cohort;model deaths = yearc agegrp cohort / type1 type3 dist=Poisson;

run;title2 ’Equation 5.2: Cohort, age, year, and offset log(Pop)’;proc genmod;

class yearc agegrp cohort;ods listing exclude obstats; /* turn off the obstats listing */output out=fitted reschi=reschi; /* create an output data set */model deaths = yearc agegrp cohort

/ obstats type1 type3 offset=logpop dist=Poisson;run;proc gplot data=fitted; /* bubble plot of residuals */

bubble reschi * agegrp= year;run;





Deviance 46 30.2652 0.6579








Intercept 2015.6260

year 1257.0724 4 758.55 <.0001

agegrp 114.5923 17 1142.48 <.0001

cohort 30.2652 20 84.33 <.0001




year 3 48.09 <.0001

agegrp 16 613.61 <.0001

cohort 20 84.33 <.0001

A good, intuitive model should take into account the additional information providedby the changing population structure in post-war Japan. Equation 5.1 assumes that thedistribution of the various ages remains unchanged over the time periods covered by thedata.

Most of the log-linear models considered up to this point in previous chapters describemean counts in terms of their row or column membership in a table. In this data set thepopulation variable is a continuous valued covariate. The following method demonstrateshow to include this population data into the model. The model uses log population asan OFFSET variable. The effects of age, cohort, and year are still expressed as CLASSvariables.

In this model,

log(λi j ) = log(Popi j ) + µ + αi + β j + θi− j (5.2)

Popi j denotes the population of the (i, j) category in Table 5.1.


The parameters αi , β j and θi− j are not constrained except to provide identifiability.The fit is very good: (χ2 = 30.0; 46 df; p = .97). It is as good as that of Equation 5.1 butEquation 5.2 uses more information and is more easily interpreted. Program 5.1 fits thismodel and produces Output 5.2.




Deviance 46 30.4253 0.6614








Intercept 1543.7788

year 1222.5236 4 321.26 <.0001

agegrp 89.4228 17 1133.10 <.0001

cohort 30.4253 20 59.00 <.0001




year 3 9.82 0.0202

agegrp 16 554.97 <.0001

cohort 20 59.00 <.0001

Notice that there is no parameter to be estimated in association with the offset populationvariable in Equation 5.2. The result of the offset is that the fitted cancer rate is proportionalto the population size.

The regression coefficient of the log-population is set to the value 1 in Equation 5.2,which corresponds to the status this variable has as an OFFSET variable in Program 5.1.Using log population as an offset variable is the same as modeling the mean number ofdeaths proportional to population.

Specifically, the expected number of deaths λi j is proportional to the population inthe i th age and j th year groups times marginal effects for these years, ages, and cohorts.


That is,

cancer rate = λi j/Popi j

or more generally,

λi j = Popi j × exp(µ + αi + β j + θi− j )

The scale in which the population is measured does not matter. If the population ismeasured in thousands or millions, for example, the scale will be reflected in differentvalues of the intercept, µ. The parameters αi , β j , and θi− j model the number of deaths forthe respective ages, years, and cohorts, respectively.

There are several benefits to using log population as an OFFSET variable. The first ofthese is the benefit that is gained in terms of interpretation, with only a negligible loss inlack of model fit. Even though it results in one less df, intuitively, one expects the numberof cancer deaths to be proportional to the population size. This benefit in interpretationmore than offsets the loss of 1 df. This data set offers ample df and opportunities for mod-eling, but there are important statistical advantages in using the simplest possible modelwhen making inferences. These advantages are illustrated in the discussion and models ofSection 6.2.

A second benefit to using a more intuitive model is the attenuation of the chi-squaredvalues for Type 1 and Type 3 analyses when you compare the corresponding values inOutputs 5.1 and 5.2. In every case, the significance levels for age, year, and cohort are alllarge, but they are less striking in Output 5.2. This is partly due to the slight decrease inoverall goodness of fit but also because the marginal effects of year and age changes areaccounted for by changes in the population structure.

Program 5.1 also produces the residual bubble plot given in Figure 5.1. It plots theReschi chi-squared residuals for Equation 5.2. More recent years are represented by larger

Figure 5.1Pearson residuals for Equation 5.2 of the cancer deaths.

0 20 40 60 80

Age in years

−2

0

2

Reschi

a

aa a

a

a a aa

a

a

a

a

aa

ac

c

c

c

cc

c

c

c

c

c

c

c

c

c

cc

c

e e

e

e

e e

e

e e

e

e

e

ee

e

e

e

egg

g

gg g

g

g

g

g

g

g

gg

g

g

g

gi

i

i

i

ii

i

i

i

i

i

ii

i

i

i

i

i


bubbles. There are no residuals larger than 2 in absolute magnitude. The youngest ages andthe 60 year-old men appear to have the smallest deviations from the model.

The Type 1 and Type 3 analyses in Outputs 5.1 and 5.2 show that there are huge ageand cohort effects but a more modest effect due to the year. The null hypotheses for theseanalyses is that there are no differences in years, ages, or cohorts. The differences in dffor these two different analyses can be traced to the confounding of various effects. AType 3 statistical analysis considers the statistical significance of a term as though thatterm is added last to the log-linear model, after all other effects have been included. Certainconfounded effects cannot be estimated once other effects are already in the model. Thisresults in fewer df for the Type 3 statistical tests given in Output 5.2. Section 2.5.2 describesthe Type 3 analysis in more detail.

5.3 Poisson Regression with Overdispersion

The Galapagos Islands are an archipelago of 30 named Pacific islands straddling the equa-tor about 500 miles from the coast of South America. Their geographical isolation hasmade them the subject of many biological studies of their flora and fauna from the timeof Darwin’s visit in the 1830s up to today. Table 5.2 gives the number of species on eachof the islands along with a number of covariates that might prove useful in modeling thenumber of different species that each island can support. (The word galapagos is Spanishfor tortoises, and refers to the giant creatures found on these islands.)

Not all species are the same in terms of their needs for survival. Some plants and shell-fish cling to bare rocks by the sea while more advanced life forms such as birds might needa variety of insect life and plant seeds for survival. The number of species living on eachisland can still be examined using a Poisson distribution. The lack of equivalence of thevarious species results in an overdispersion, or increase, in the overall variability of thecounts in Table 5.2.

A number of useful covariates are included in Table 5.2 that can be shown to influencethe number of different species on each island. The land-mass area of the island is a usefulindication of the number of species that the island can support. Larger islands can providea wider variety of opportunities for a greater diversity of life.

The distances to the nearest neighboring island and to Santa Cruz are indications ofthe relative isolation of each island. Santa Cruz is the second largest of the archipelagoand is located at approximately the geographic center of the group. Santa Cruz also offersthe largest number of different species. The islands Darwin and Wolf are located at someconsiderable distance to the north of the others. The last column of Table 5.2 gives thenumber of species on the nearest neighboring island. A large number of species on a nearbyisland might serve to increase the diversity of an otherwise less habitable island.

Program 5.2 provides an example of how models can be fit to this data. A variety ofPoisson regression models are possible. One model with all the covariates of Table 5.2 andtheir pairwise interactions is summarized in the GENMOD Output 5.3. New covariates canalso be constructed that offer biological interpretation and plausibility of their influence.With some small effort you might uncover several additional functions of covariates inaddition to those described here that significantly influence the mean number of speciesfound on each island.


TABLE 5.2 Diversity of species on each of the Galapagos Islands. Source: Johnson and Raven (1973);Andrews and Herzberg (1985, pp. 291–3).

Distance in km Species onSpecies Area to nearest to Santa adjacent

Island observed in km2 neighbor Cruz island

Baltra 58 25.09 .6 .6 44Bartolome 31 1.24 .6 26.3 237Caldwell 3 .21 2.8 58.7 5Champion 25 .10 1.9 47.4 2Coamano 2 .05 1.9 1.9 444Daphne Major 18 .34 8.0 8.0 44Daphne Minor 24 .08 6.0 12.0 18Darwin 10 2.33 34.1 290.2 21Eden 8 .03 .4 .4 108Enderby 2 .18 2.6 50.2 25Espanola 97 58.27 1.1 88.3 58Fernandina 93 634.49 4.3 95.3 347Gardner A 58 .57 1.1 93.1 97Gardner B 5 .78 4.6 62.2 3Genovesa 40 17.35 47.4 92.2 51Isabela 347 4669.32 .7 28.1 91Marchena 51 129.49 29.1 85.9 104Onslow 2 .01 3.3 45.9 25Pinta 104 59.56 29.1 119.6 51Pinzon 108 17.95 10.7 10.7 8Las Plazas 12 .23 .5 .6 58Rabida 70 4.89 4.4 24.4 237San Cristobal 280 551.62 45.2 66.6 58San Salvador 237 572.33 .2 19.8 70Santa Cruz 444 903.82 .6 .0 8Santa Fe 62 24.08 16.5 16.5 444Santa Marıa 285 170.92 2.6 49.2 25Seymour 44 1.84 .6 9.6 58Tortuga 16 1.24 6.8 50.9 108Wolf 21 2.85 34.1 254.7 10

Used with permission: Springer-Verlag.

The following program fits Poisson and negative binomial regression models to thespecies data of Table 5.2.

Program 5.2 title1 ’Species diversity on the Galapagos Islands’;data species;

input name $ 1-13 species area d2neigh d2sc adjsp;loga=log(area);area=area/1000; /* rescale variables */adjsp=adjsp/100;d2sc=d2sc/100;d2neigh=d2neigh/100;isa=0; if name=’Isabela’ then isa=1; /* indicator for Isabela */label

species = ’# of species on island’area = ’in sq-km’


loga = ’log area’d2neigh = ’distance to nearest neighbor’d2sc = ’distance to Santa Cruz’adjsp = ’# of species on adjacent island’;

datalines;Baltra 58 25.09 0.6 0.6 44


Wolf 21 2.85 34.1 254.7 10run;proc print;run;title2 ’Fit Poisson model with all pairwise interactions’;proc genmod;

ods output obstats=fit;model species = loga | d2neigh | adjsp | d2sc @2 isa

/ dist=Poisson obstats type1 type3;run;proc plot data=fit; /* plot Poisson Pearson residuals */

plot reschi * loga;run;title2 ’Fit Negative Binomial model with all pairwise interactions’;proc genmod;

ods output obstats=fitnb;model species = loga | d2neigh | adjsp | d2sc @2 isa

/ dist=nb obstats type1 type3;run;proc plot data=fitnb; /* plot negative binomial residuals */

plot reschi* loga;run;




Deviance 18 277.7882 15.4327



Scaled Pearson X2 18 258.1626 14.3424







Error

Wald 95%Confidence


Intercept 1 3.0806 0.1154 2.8545 3.3068 712.81 <.0001

loga 1 0.4351 0.0175 0.4008 0.4695 615.14 <.0001

d2neigh 1 1.2972 1.0999 -0.8585 3.4528 1.39 0.2382

loga*d2neigh 1 0.1993 0.1137 -0.0236 0.4221 3.07 0.0797

adjsp 1 0.0879 0.0596 -0.0289 0.2046 2.17 0.1403

loga*adjsp 1 -0.0352 0.0127 -0.0601 -0.0103 7.65 0.0057

d2neigh*adjsp 1 -0.8172 0.4616 -1.7219 0.0875 3.13 0.0767

d2sc 1 0.7207 0.1828 0.3625 1.0789 15.55 <.0001

loga*d2sc 1 -0.1302 0.0341 -0.1971 -0.0633 14.56 0.0001

d2neigh*d2sc 1 -3.1456 0.6614 -4.4418 -1.8493 22.62 <.0001

adjsp*d2sc 1 -0.1958 0.1047 -0.4009 0.0093 3.50 0.0614

isa 1 -0.5645 0.0962 -0.7530 -0.3759 34.43 <.0001

Scale 0 1.0000 0.0000 1.0000 1.0000





loga 1 831.85 <.0001

d2neigh 1 1.37 0.2411

loga*d2neigh 1 3.16 0.0756

adjsp 1 2.14 0.1432

loga*adjsp 1 7.41 0.0065

d2neigh*adjsp 1 3.15 0.0757

d2sc 1 15.65 <.0001

loga*d2sc 1 14.35 0.0002

d2neigh*d2sc 1 23.13 <.0001

adjsp*d2sc 1 3.51 0.0612

isa 1 34.40 <.0001


There is no question that larger islands support a wider variety of life forms. A strongcorrelation exists between the mean number of species and the island’s area. Poisson re-gression models the log of mean number of species rather than mean number of speciesdirectly. So, the linear relationship is expressed by including log(area) in the GENMODMODEL statement. (Recall the use of the log of a covariate, rather than the value of the co-variate itself in the population offset variable in Section 5.2.) The log area is called LOGAin Program 5.2 and Output 5.3. This variable has a huge statistical effect in explaining thenumber of species: (χ2 = 615; 1 df; p < 10−6).

Isabela is by far the largest (but not the most biologically diverse) island with an areaover five times that of the second largest, Santa Cruz. The influence that Isabela’s sizeexerts in most regression models prompts the use of a 1 df indicator variable (ISA) tomodel this one island’s contribution to the data. Output 5.3 shows that the statistical effectof this indicator variable is large: (χ2 = 34.4; 1 df; p < 10−4).

The log of the area is such an important covariate in the model of diversity that it appearsin several interactions with other covariates that are not statistically significant by them-selves. The number of species on an adjacent island (ADJSP), for example, is not a usefulcovariate by itself: (χ2 = 2.14; 1 df; p = .14). A neighboring island with few species isunlikely to provide any new life forms to its neighbor. Instead, an important influence ondiversity is the interaction between the log-area and the number of species on the adjacentisland. This interaction is given in Output 5.3 with χ2 = 7.41; 1 df; p = .0065. In otherwords, a large area combined with a diverse neighbor provides a useful environment formany species but a diverse neighbor by itself is not sufficient.

Several remarkable residuals are indicated in the chi-squared residual plot of Figure 5.2.Isabela has already been identified as the largest island, by far, and its residual is identicallyzero because an indicator variable is fitted for this single observation. The Darwin and Wolfislands are indicated in Figure 5.2 but do not exhibit extreme residuals. These two islandshave already been mentioned as being located at a great distance from the others.

The islands Santa Marıa and Marchena are also identified in this plot. These two ratherlarge islands are also located at some distance from the others but not as far away fromthe central Santa Cruz island as Darwin and Wolf. Santa Marıa and Marchena presentextremely large positive and negative residuals, respectively.

Overall, there appears to be a large degree of overdispersion in the data. This is reflectedin the Value/DF of more than 14, as given in Output 5.3, for the deviance and chi-squaredstatistics.

The Reschi residuals in Figure 5.2 range between ±6 rather than ±2 as might beexpected in a well-fitting model for Poisson distributed data. This is further evidence ofoverdispersion. As a rough estimate, you can see that the standard errors are about 3 timestoo large. While there are some remarkable outliers, the lack of fit of the overall modelcannot be traced to a small number of unusual values. Instead, there appears to be an over-all increase in the variability in Figure 5.2 and this increase is spread over the whole dataset.

There are three formal ways that the amount of overdispersion can be estimated byGENMOD in this example. Two of these methods are based on the values of the devianceand the Pearson chi-squared goodness of fit statistics. The third method fits the negativebinomial distribution.

The overdispersion measured in GENMOD using the scale=Pearson or scale=p op-tion gives the estimate

√Pearson Chi-Square/df = √

258.16/18 = 3.787

using the Pearson values in Output 5.3. If you use the scale=p option, then all of thestandard errors are multiplied by this amount and the significance levels are adjusted ac-cordingly. This estimate of 3.78 is consistent with the rough guess of 3 that can be obtainedby looking at Figure 5.2.


Figure 5.2Pearson residuals for the Poisson model of the species data in Table 5.2. Several remarkableislands are identified.

-5 0 5 10

Log area of island

−6

−4

−2

0

2

4

6

Reschi

Marchena

IsabelaDarwin

Santa Marıa

Wolf

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

Overdispersion measured with scale = deviance or scale = d in the present ex-ample yields the estimate

√Deviance/df = √

277.788/18 = 3.928

The SCALE= option is also described in Section 1.5.The third method, which fits the negative binomial model, is illustrated in Program 5.2

with the corresponding Output 5.4. The negative binomial model is fit using the dist=nbspecification for the distribution in the MODEL statement.





Deviance 18 33.3593 1.8533








Error

Wald 95%Confidence


Intercept 1 3.3274 0.3438 2.6536 4.0012 93.69 <.0001

loga 1 0.3966 0.0600 0.2791 0.5142 43.73 <.0001

d2neigh 1 0.5056 4.2564 -7.8368 8.8481 0.01 0.9054

loga*d2neigh 1 0.2028 0.5020 -0.7811 1.1867 0.16 0.6863

adjsp 1 -0.0458 0.2646 -0.5643 0.4728 0.03 0.8627

loga*adjsp 1 0.0119 0.0633 -0.1121 0.1358 0.04 0.8512

d2neigh*adjsp 1 -0.8018 2.3808 -5.4681 3.8644 0.11 0.7363

d2sc 1 0.3353 0.6736 -0.9850 1.6557 0.25 0.6186

loga*d2sc 1 -0.1110 0.1205 -0.3473 0.1252 0.85 0.3569

d2neigh*d2sc 1 -1.9180 2.5254 -6.8678 3.0318 0.58 0.4476

adjsp*d2sc 1 -0.1932 0.6167 -1.4019 1.0156 0.10 0.7541

isa 1 -0.6663 0.7026 -2.0435 0.7108 0.90 0.3430

Dispersion 1 0.2820 0.0934 0.1473 0.5397

NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood.






loga 1 27.81 <.0001

d2neigh 1 0.01 0.9053

loga*d2neigh 1 0.16 0.6864

adjsp 1 0.03 0.8606

loga*adjsp 1 0.04 0.8506

d2neigh*adjsp 1 0.11 0.7367

d2sc 1 0.25 0.6185

loga*d2sc 1 0.82 0.3642

d2neigh*d2sc 1 0.57 0.4520

adjsp*d2sc 1 0.10 0.7545

isa 1 0.82 0.3640

The negative binomial extra-variation parameter value is called Dispersion in Out-put 5.4 and its estimated value is .2820. This model has the same covariates as the modelgiven in Output 5.3. Specifically, it contains all of the variables included in Table 5.2, allof their pairwise interactions, and a single df indicator variable for Isabela. New estimatedvalues for the regression coefficients are also obtained in Output 5.4 and these are generallyclose in value to these seen in Output 5.3.

The reciprocal of the negative binomial dispersion estimate is

1/0.2820 = 3.5461.

You can compare this number to the values obtained using scale=p and scale=d, givenabove. All of the standard errors in Output 5.4 have been multiplied by this amount and thecorresponding confidence intervals and significance levels are adjusted accordingly. Thechi-squared residuals are plotted against the islands’ log areas in Figure 5.3.


Figure 5.3Pearson residuals for the species data fitting the negative binomial model. The smallestislands appear to have the greatest variability.

-5 0 5 10

Log area of island

−2

−1

0

1

2

Reschi

Isabela

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

tt

t

t

t

t

t

t

t

t

t

t

t

t

Thus, the fit of an overdispersion parameter in Output 5.4 increases all estimated stan-dard errors and attenuates the significance levels. The only statistically significant effect inthis output is the log area. Compare this conclusion to Output 5.3 in which there are sev-eral highly statistically significant effects in the model of species diversity, but the overallmodel lacks a good fit.

Measures of overdispersion are just one useful diagnostic, and the examination of spe-cific residuals is another. Figure 5.3 does not exibit any extreme outliers and the negative bi-nomial parameter seems to have captured all of the excess variation. All of the chi-squaredresiduals are within the range that you would expect them to be in for a well-fitting model.

The pattern of differences in diversity among the smallest islands is easily seen in Fig-ure 5.3. The overdispersion is therefore largely attributed to the variablilty of smaller is-lands. The larger islands exert a large degree of leverage in which they pull the fitted modeltoward their observations, resulting in small residuals. Diversity among the small islandsis harder to explain using only the covariates given in this data set. These are the islands inthe archipelago that require the greatest additional study to discover the true determinantsof species diversity.

Chapter

6Finite Population SizeEstimation

6.1 Introduction 1016.2 A Small Example 1016.3 A Larger Number of Lists 105

6.1 Introduction

The goal of the methods described in this chapter and the next is to estimate the size of aclosed population. The examples take the form of multidimensional cross-classified tablesof frequencies. The problem is that some discrete categories of the data are not observ-able. The method of this chapter is to fit a log-linear model to the observable data andthen extrapolate this model to estimate the mean of the unobservable data. A different ap-proach to this problem of population size estimation using a truncated Poisson distributionis described in Chapter 7.

The earliest uses of these techniques were motivated by problems in fisheries andwildlife management. More recent applications have been made by epidemiologists andcensus statisticians employed by government agencies. Surveys of this type are sometimesreferred to as mark-recapture or capture-recapture samples. In a fishery or wildlife setting,a number of individuals are caught and marked or identified in some manner. These indi-viduals are then released and allowed to re-enter the population base. A subsequent waveof captures might contain a number of those identified earlier. A comparison of the numberof individuals captured two or more times can be used to estimate the total population size.

6.2 A Small Example

The data in Table 6.1, given in Hook, Albright, and Cross (1980), lists the number ofrecorded infants born in New York State with spina bifida. Spina bifida is a serious birthdefect in which the spinal chord is not entirely enclosed at the time of birth. This usuallyresults in the death of the infant shortly after birth. The problem here is to estimate thenumber of babies born with this affliction in the population from the sample data given inTable 6.1.

Hook et al. examined three administrative lists to identify these infants: birth certificates,death certificates, and medical records. Each of these three lists provides an incompletesource of names of all infants born with spina bifida. The data of Table 6.1 summarizes thefrequencies of infant’s names appearing on all combinations of lists.

Although the birth defect is immediately obvious upon delivery of the baby, it might notbe mentioned on the birth certificate. Similarly, while spina bifida is almost certainly thecause of death, it is not always listed on the death certificate. If the child dies quickly afterbirth then there might not have been sufficient time to generate a medical record describing


TABLE 6.1 Infants born with spina bifida in New York State, 1969–74. The number in the ??? category isnot observed and estimates of its value are given in Table 6.2 using Program 6.1. Source: Hook et al.(1980).

Name on medical record?

No Yes

Name on birth Death certificate? Death certificate?

certificate? No Yes No Yes

No ??? 49 60 4Yes 247 142 112 12

Used with Permission: American Journal of Epidemiology, Oxford University Press.

the treatment, if any. In other words, each of these three lists is incomplete and there aredifferent reasons for these omissions. In the language of Chapter 4, the ??? count is astructural zero because it cannot be observed.

These three lists are also correlated with one another. As an example, if the birth certifi-cate lists spina bifida, then the delivery staff might have overlooked the need to mention itagain on the death certificate. These two lists might then have a negative log odds ratio inthe sense that names on the list of birth certificates would be less likely to be included onthe list of death certificates.

The statistical problem is to estimate the number of infants whose names do not appearon any of these three lists. This is the count in the cell indicated by the ??? in Table 6.1.In the process of estimating the count in this cell you can model the various interactionsbetween the three lists using log-linear models in GENMOD.

Program 6.1 uses the GENMOD procedure to generate every possible log-linear modelfor this data. Indicator variables for the three lists are coded so that the intercept of eachmodel (when birth = death = medic = 0) is the estimate for the log of the count in theunobserved category.

Program 6.1 data;input count birth death medic ;label

birth = ’names on birth certificate’death = ’names on death certificate’medic = ’names on medical record’ ;

datalines;60 0 0 149 0 1 04 0 1 1

247 1 0 0112 1 0 1142 1 1 012 1 1 1

run;title2 ’Log-linear model of mutual independence’;proc genmod; /* m, d, b model */

model count = medic birth death /dist = Poisson;run;title2 ’Log-linear models with one interaction’;proc genmod; /* d, b*m model */

model count = medic birth death birth*medic / dist = Poisson;run;

Chapter 6 Finite Population Size Estimation 103

proc genmod; /* m, b*d model */model count = medic birth death birth*death / dist = Poisson;

run;proc genmod; /* b, d*m Equation 6.1 with output in Output 6.1 */

model count = medic death birth death*medic / dist = Poisson;run;title2 ’Log-linear models with two interactions’;proc genmod; /* b*m, d*b model */

model count = medic birth death birth*medic birth*death/dist = Poisson;

run;proc genmod; /* b*d, m*d model */

model count = medic birth death birth*death death*medic/dist = Poisson;

run;proc genmod; /* d*m, m*b model */

model count = medic birth death death*medic birth*medic/dist = Poisson;

run;

There are only seven observations in Table 6.1 so a hierarchical log-linear model with athree-way interaction has more parameters than observations. The simpler log-linear modelwith all three pair-wise interactions is also saturated. It fits the data exactly; there is oneparameter for each of the seven observations and χ2 = 0. It also has an intercept, but this(and all other) parameters are measured without meaningful estimates of their standarderrors. Such models are generally not useful and are excluded from Program 6.1.

Among all of the well-fitting models with fewer parameters than observations, thereare some better choices than others. Below, you will see that there are often benefits tousing models with fewer parameters that, perhaps, fit less well than models with a greaternumber of interaction terms. That is, the aim is not always to identify the best possiblefitting model. There is a benefit of using parsimony in model choice. Simpler models aregenerally associated with smaller estimated standard errors and confidence intervals.

For each of the fitted models in Program 6.1, the GENMOD procedure calculates agoodness of fit chi-squared and the df. Extremely small p-values indicate that the modeldoes not fit the data well.

Table 6.2 gives a summary of all seven log-linear models fit to the spina bifida data. Thefirst row summarizes a model of mutual independence among the three lists. The shorthandfor this model is [b, m, d] corresponding to

log mbmd = µ + αb + βm + γd .

TABLE 6.2 Log-linear models used to estimate the number of infants born with spina bifida whose namesdo not appear on any of the three administrative lists. These values are derived from the output ofProgram 6.1. Equation 6.1 is indicated by (†).

Terms in the Estimated missing count C.I.log-linear model χ2 df p-value Point 95% C.I. width

b, m, d 51.06 3 .000 137.88 (108.37, 175.43)b ∗ m, d 50.20 2 .000 129.94 (92.99, 181.58)b ∗ d, m 41.32 2 .000 205.31 (148.98, 282.92)d ∗ m, b(†) 3.87 2 .144 104.93 (81.47, 135.14) 53.67b ∗ m, b ∗ d 32.21 1 .000 735.02 (257.17, 2100.79)d ∗ m, b ∗ d .003 1 .956 132.32 (94.41, 185.44) 91.03b ∗ m, d ∗ m .64 1 .424 85.23 (60.19, 120.70) 60.51


That is, each mean mbmd in this model is a product of terms corresponding to each of thethree lists.

The second, third, and fourth rows summarize models with one pairwise interaction,including Equation 6.1.

log mbmd = µ + αb + βm + γd + λdm (6.1)

Equation 6.1 specifies an interaction between the medical records and death certificates.In this model, names on the death and medical lists are independent of the list of nameson the birth certificates. The shorthand for Equation 6.1 is [b, d ∗ m]. Output 6.1 containssome of the output for this model.




Deviance 2 3.8599 1.9299








Error

Wald 95%Confidence


Intercept 1 4.6533 0.1291 4.4003 4.9062 1299.89 <.0001

medic 1 -0.7159 0.1048 -0.9213 -0.5105 46.67 <.0001

death 1 -0.6112 0.1020 -0.8111 -0.4112 35.90 <.0001

birth 1 0.8561 0.1123 0.6360 1.0762 58.13 <.0001

medic*death 1 -1.7638 0.2806 -2.3137 -1.2138 39.52 <.0001

Scale 0 1.0000 0.0000 1.0000 1.0000


The last three rows summarize log-linear models with two pairwise interactions. Everymodel has a chi-squared goodness of fit, df, and corresponding p-value. Table 6.2 alsogives point estimates and confidence intervals for the means of the number of infants in theunobserved category.

The parameter modeling the interaction between death and medical lists (d ∗m) appearsto be very important in Table 6.2. All log-linear models without this interaction fit the datapoorly. All three of the models that contain it fit the data adequately.


Every model in Table 6.2 has an estimated intercept, which is the log of the estimatefor the count in the missing cell. Both the intercept and its standard error are producedby GENMOD in the output section labeled Analysis of Parameter Estimates. Thepoint estimate for the number of missing infants in each model is

exp(intercept).

For Equation 6.1, this estimate is

exp(4.6533) = 104.93.

The approximate 95% confidence interval associated with this estimate is obtained as

exp(intercept ± 1.96 std)

or e raised to each of the powers of the upper and lower confidence interval of the interceptgiven by GENMOD.

In the case of Equation 6.1 this confidence interval is

exp(4.6533 + 1.96 ∗ 0.1291) = exp(4.9062) = 135.14

and

exp(4.6533 − 1.96 ∗ 0.1291) = exp(4.4003) = 81.47.

These confidence intervals for the mean of the missing count are given in Table 6.2 foreach of the fitted log-linear models.

Of the three adequately fitting models in Table 6.2, the 2 df Equation 6.1 has the nar-rowest 95% confidence interval. The two well-fitting 1 df models have wider confidenceintervals for the estimated mean of the unobservable. They are wider even though their fitis much better in terms of the chi-squared statistics.

This is often the conclusion reached when you are using a log-linear model with anadequate, but not necessarily the best, fit over all choices. The lesson to be learned is thatsimpler models usually produce smaller confidence intervals, even if they are not neces-sarily the best fitting. This point is demonstrated again in the example of the followingsection.

To summarize, Equation 6.1 provides a good fit with a small number of parameters.Names that appear on the death and medical records are closely related, but the nameson the birth certificates appear independently of the other two lists. Using Equation 6.1,you can estimate the number of missing names in Table 6.1 to be about 105 with a 95%confidence interval of (81, 135). The two other adequately fitting models summarized inTable 6.2 have other interpretations and estimates for the number of missing infants. Allthree of these confidence intervals, however, have some overlap, showing that the estimatesare not far from each other.

6.3 A Larger Number of Lists

Table 6.3 summarizes frequencies of birth defects with an unobservable category. In thisexample there are five sources for lists of names of children who are born with a specificcongenital deformity. These five lists are as follows:

1. Hospital records

2. Obstetric records

3. School records


4. Department of Mental Health records

5. Department of Health records.

TABLE 6.3 Number of infants born with a specific congenital deformity according to five differentsources of names. The count in the category indicated by ?? is not observed. Source: Bishop, Feinberg,Holland (1975, p. 253); Wittes (1970).

List 1: Yes Yes No No

List 2: Yes No Yes No

List 3 List 4 List 5

Yes Yes Yes 2 0 3 0Yes Yes No 8 23 5 30Yes No Yes 2 0 5 3Yes No No 18 34 36 83No Yes Yes 5 3 1 2No Yes No 25 37 22 97No No Yes 1 1 4 4No No No 19 37 27 ??

Used with permission: MIT Press.

Each of these five sets of records is incomplete to some degree and is correlated witheach of the other four. As in the previous section, the aim is to estimate the number ofchildren whose names do not appear on any of these five lists. In the following examination,the five lists are referenced by their numbers instead of their names.

A useful way to proceed with high dimensional data such as this is to have GENMODbuild all log-linear models with specified orders of interactions. Such a summary is givenin Table 6.4. The first row of Table 6.4 summarizes a model with only an intercept. Thesecond row summarizes a model with mutual independence among all five lists. The thirdrow summarizes a model with all pairwise interactions obtained using the @2 option in theGENMOD MODEL statement. The SAS code to fit this model is

proc genmod;model freq = list1 | list2 | list3 | list4 | list5 @2

/ dist = Poisson;run;

The same MODEL statement using @3 fits a model with all possible three-way interac-tions. The presence of a missing count and several zero counts prevents you from fitting alog-linear model with four-way and five-way interactions.

TABLE 6.4 Goodness of fit for log-linear models containing all interactions of the specified orders.Estimates are given for the missing category count, 95% confidence interval, and the width of theseintervals.

Estimated missing count

Order df χ2 p-value G2 p-value Point 95% C I C I width

0 30 943.90 761.181 25 99.60 < 10−4 93.45 < 10−4 101.5 (82.6; 124.7) 42.02 15 19.56 .19 22.27 .10 114.4 (71.4; 183.5) 112.13 5 6.24 .28 6.89 .23 95.6 (25.0; 366.0) 341.0


Each row in Table 6.4 has chi-squared and deviance goodness of fit statistics. The usefulfeature of a table such as this is to identify the highest level of interactions you need toinclude in a well-fitting model. The last two lines of this table identify well-fitting models.The aim is always to find a well-fitting model with the smallest number of parameters.In the previous section you saw that a well-fitting model with the smallest number ofparameters (and larger df) provides the narrowest estimated confidence interval.

In the present example, such a model lies in between the model of independence (theOrder 1 model in Table 6.4) and the model with all pairwise interactions (the Order 2model). This fitting process ends with a log-linear model for this data that contains all of themain effects for each of the five lists of names and a subset of their pairwise interactions.

Table 6.4 also gives estimates and confidence intervals for the frequency in the unob-servable category of children. These estimates are obtained in the same manner as those inSection 6.2. There is a fairly close agreement in these estimates. They range between 96and 114 and have a large overlap. The confidence intervals in Table 6.4 increase in widthwith the number of parameters in the models. That is, saturated models tend to increase theestimated standard errors. This point is also made in the previous section.

The output of the model that fits all pairwise interactions between the five lists is givenin Output 6.2. From this output, you can see the relative importance of each of the

(52

) = 10pairs of interaction effects generated using the @2 option. These are, in decreasing orderof their respective chi-squared statistics: lists 2 and 5; 3 and 4; 1 and 2; 1 and 4; and soon. A useful way to proceed is to add these interactions one at a time to the model ofindependence and examine the fit at each point. This forward stepwise selection process issummarized in Table 6.5.

TABLE 6.5 Goodness of fit for log-linear models containing specified effects and interactions.

Estimated missing count

Terms in model df χ2 p G2 p Point 95% CI CI width

[1, 2, 3, 4, 5] 25 99.60 .00 93.45 .00 101.5 (82.6, 124.7) 42.0[1, 3, 4, 2 ∗ 5] 24 69.83 .00 72.77 .00 108.0 (87.8, 132.9) 45.1[1, 2 ∗ 5, 3 ∗ 4] 23 35.21 .05 39.19 .02 72.9 (56.2, 94.5) 38.8

[1 ∗ 2, 2 ∗ 5, 3 ∗ 4] 22 28.20 .17 30.56 .11 83.5 (63.5, 109.8) 46.4[1 ∗ 2, 1 ∗ 4, 2 ∗ 5, 3 ∗ 4] 21 23.50 .32 25.81 .21 97.5 (71.6, 132.8) 61.2

There is also a backward (elimination) stepwise procedure that begins with all pairwiseinteractions of Output 6.2 and then removes the least statistically significant of these oneat a time. In the present example, the forward and backward stepwise methods arrive at thesame final model.





Deviance 15 22.2746 1.4850








Error

Wald 95%Confidence


Intercept 1 4.7400 0.2409 4.2679 5.2122 387.11 <.0001

list1 1 -1.2008 0.2239 -1.6397 -0.7620 28.76 <.0001

list2 1 -1.3138 0.2278 -1.7603 -0.8672 33.25 <.0001

list1*list2 1 0.6617 0.1991 0.2716 1.0519 11.05 0.0009

list3 1 -0.2569 0.2313 -0.7103 0.1964 1.23 0.2667

list1*list3 1 0.1854 0.2044 -0.2152 0.5860 0.82 0.3644

list2*list3 1 0.2165 0.2136 -0.2022 0.6351 1.03 0.3109

list4 1 -0.2203 0.2316 -0.6742 0.2337 0.90 0.3416

list1*list4 1 0.5236 0.2080 0.1159 0.9314 6.33 0.0118

list2*list4 1 -0.1444 0.2114 -0.5587 0.2698 0.47 0.4944

list3*list4 1 -0.8787 0.2118 -1.2938 -0.4637 17.22 <.0001

list5 1 -3.6889 0.3886 -4.4505 -2.9274 90.13 <.0001

list1*list5 1 -0.0031 0.3666 -0.7216 0.7153 0.00 0.9932

list2*list5 1 1.6186 0.3679 0.8975 2.3396 19.36 <.0001

list3*list5 1 0.0820 0.3657 -0.6347 0.7986 0.05 0.8226

list4*list5 1 0.2365 0.3656 -0.4799 0.9530 0.42 0.5176

Scale 0 1.0000 0.0000 1.0000 1.0000



The forward stepwise process generates a lot of SAS output, which is summarized inTable 6.5. This table gives a list of models in order of increasing complexity. Each modelis associated with goodness of fit measures and estimates for the mean of the unobservablecategory of children. The sequence of models described in this table provides increasingimprovements in the goodness of fit. The last two models in Table 6.5 are adequate andprovide a reasonable measure of fit to the data.

Confidence intervals are also given in Table 6.5 and, for comparison, the widths of theseintervals are also given. With the exception of the [1, 2 ∗ 5, 3 ∗ 4] model, corresponding to

log m12345 = µ + α1 + β2 + γ3 + δ4 + ε5 + λ25 + φ34, (6.2)

the widths of the confidence intervals increase with the number of terms in the model.Equation 6.2 and the [1 ∗ 2, 2 ∗ 5, 3 ∗ 4] model, corresponding to

log m12345 = µ + α1 + β2 + γ3 + δ4 + ε5 + λ25 + φ34 + ω12, (6.3)

have rather small estimates for the number of children missing from the data.Goodness of fit and summary statistics for the log-linear models 6.2 and 6.3 are given on

lines 3 and 4 of Table 6.5. Both of these models provide a good fit to the data. The model inEquation 6.2 provides a narrower confidence interval for the mean of the missing categoryfor these two log-linear models and has one fewer parameter than does the model givenby Equation 6.3. The model in Equation 6.2 estimates the unobservable count in Table 6.3to be 83.5 with a 95% confidence interval of 63.5 to 110. The last model summarized inTable 6.5 with 4 interaction terms offers the best fit but also has the largest confidenceinterval.

110

Chapter

7Truncated Poisson Regression

7.1 Introduction 1117.2 Mathematical Background 1127.3 Truncated Poisson Models with Covariates 1177.4 An Example with Overdispersion 1197.5 Diagnostics and Options 121

7.1 Introduction

As in the previous chapter, the methods in this chapter are motivated by wildlife surveysand epidemiologic studies. Many of the applications for these methods are for problemsthat involve estimating the size of a finite population. Chapter 6 covers a problem in whichevery search for a list of names is different and possibly interrelated with other lists. Log-linear models of the interactions between these lists are fit in order to estimate the numberof counts in the missing category. This chapter covers settings in which there are a largenumber of surveys that can be treated as identical as well as independent. The differencebetween these methods and those of Chapter 6 is that here you know how many lists eachindividual belongs to, but not which lists.

The number of times an individual is recorded follows an approximate Poisson distri-bution. The difference between the Poisson model here and the one applied in Chapter 5is that only a portion of the data is observable here. Specifically, you cannot observe thefrequency of individuals who are never seen and who fail to appear in any of the surveys.That is, for the applications in this chapter, the frequency of the zero category in the Pois-son distribution is not observable. The goal of the examples of Sections 7.3 and 7.4 is toestimate the frequency of this unobserved category. The lottery example of Section 7.5does not seek to estimate the zero category of the Poisson distribution, but rather identifiesa set of useful covariates and their effects, much like the parallel problem in multiple linearregression.

The data given in Table 7.1 is an example of the form taken by data with an unobservablezero category. A wildlife study was conducted to estimate the number of skunks in a region.The frequencies in this table summarize the history of skunks captured and released afterseveral trappings. Each trapping episode provides a list of individuals, just like the listsof the previous chapter. The form of the data in Table 7.1 records the number of timeseach individual skunk is observed. For example, in 1977, after six separate trappings, onefemale was caught five times, two females were caught four times, four were caught threetimes, and so on. Unlike the surveys examined in the previous chapter, no record is kept ofwhich skunks are observed in the first, and subsequent waves of captures. Rather, Table 7.1only records the number of times each skunk is seen, provided it has been observed at leastonce. Skunks never captured, of course, are not included in the data.

One aim of examining this data is to estimate the number of skunks missing from anyof these trappings and never counted. This is not the only goal, however. It is also useful to


TABLE 7.1 Striped skunk recapture frequencies by year and sex. Source: Greenwood et al. (1985).

Capture 1977 1978

frequencies Female Male Female Male

1 1 3 7 42 2 0 7 33 4 3 3 14 2 3 15 1 2 26 2

Mean 3.00 3.58 2.20 1.63

Used with permission: Journal of Wildlife Management.

determine any frequency differences between the two years and sexes and whether there isany evidence of a year by sex interaction.

The assumption with the application of the Poisson model to this data is that there is aconstant effort expended by those performing the trapping. That is, a skunk captured onlyonce is just as likely to have been observed at the beginning as at the end of the experiment.You also need to assume that once an individual has been captured and observed this doesnot increase or decrease the likelihood of any subsequent recapture. There are alternativemodels if these assumptions are invalid, but those models are beyond the scope of thepresent discussion.

How many skunks of each sex were never captured in each of these two years? Arethere significant year and/or sex differences? What is the estimated proportion of skunksthat were never observed? These questions are answered using the truncated Poisson distri-bution developed in this chapter. The models require some mathematical and programmingeffort in order to get GENMOD to fit them correctly. The fact that this can be done illus-trates the flexibility that GENMOD offers.

7.2 Mathematical Background

This section develops the zero-truncated Poisson distribution and describes some of its el-ementary properties. The formulas given here are not crucial for understanding this distri-bution and can be omitted without loss of continuity. They are needed, however, for fittingthe models in GENMOD. If you want to learn more about this distribution, refer to John-son, Kotz, and Kemp (1992, pp. 181–4).

Program 7.1 provides the SAS macros (%TPR and %INV) needed to perform truncatedPoisson regression in the GENMOD procedure. After these macros are created, it is onlynecessary to submit SAS code of the form

proc genmod;%tpr;model count = sex year year*sex;

run;

in order to fit log-linear models with the truncated Poisson distribution. The remainder ofthis section provides the mathematical background needed to understand the details of thetruncated Poisson distribution and the %TPR macro.

The following program creates the %TPR and %INV macros for fitting the truncatedPoisson distribution in GENMOD. Programs 7.2 and 7.3 illustrate the use of these macros.

Chapter 7 Truncated Poisson Regression 113

Program 7.1 %macro tpr; /* Truncated Poisson regression */y=_RESP_; /* Observed count */lambda=exp(_XBETA_); /* Truncated Poisson parameter */eml=exp(-lambda); /* e to the minus lambda */

/* Truncated Poisson deviance */

dev=0; /* Deviance for y=1 */if y > 1 then do;

%inv(y); /* lam is MLE for y >1 */dev= -lam + y*log(lam) - log(1-exp(-lam)); /* deviance */

end;dev = dev + lambda - y*log(lambda) + log(1-eml);deviance d = 2*dev; /* times 2 for chi-squared distribution */

/* Provide GENMOD with link functions and variance */

%inv(_MEAN_); /* Find lambda for _MEAN_ */fwdlink ey = log(lam); /* Link function: log lambda */invlink linv = lambda / (1-eml); /* Truncated Poisson mean */

/* Truncated Poisson variance */variance var = (lambda*(lambda+1)*(1-eml)-lambda**2)/(1-eml)**2;

%mend tpr;

%macro inv(ev);/* Interval bisection to find lambda from the expected value (ev) */

if &ev LE 1 then /* expected value must be >1 */lam= . ; /* lambda is not defined for ev LE 1 */

else do; /* otherwise iterate to find lambda */lamlo=&ev-1; /* lambda is between exp value less one */lamhi=&ev; /* . . . and the expected value */do until (abs(lamhi-lamlo)<1e-7); /* convergence criteria */

lam=(lamhi+lamlo)/2; /* examine midpoint */mal= lam/(1-exp(-lam)); /* mean at midpoint */if mal GE &ev then lamhi=lam; /* lower upper endpoint */if mal LE &ev then lamlo=lam; /* raise lower endpoint */

end;end;

%mend inv;

Recall from Section 1.3 that the Poisson distribution with mean λ > 0 has probabilitymass function

P[Y = y] = e−λλy/y!for y = 0, 1, . . . .

Chapter 5 describes Poisson regression models of the parameter λ as a function of co-variates. In the case of the data of Table 7.1, for example, you need to estimate the effectsof sex and year differences. The GENMOD program in the following section fits log-linearmodels of the form

log(λ) = α + β1 X1 + β2 X2 + · · ·where X j are covariate values (such as indicator variables for the year and sex in the skunkdata) and (α, β1, β2, . . .) are regression coefficients that are estimated in GENMOD using


maximum likelihood. The log (base e) of λ is the most commonly used link function inPoisson regression.

The truncated Poisson distribution arises when the zero counts in the familiar Poissondistribution are not observable. For example, the number of times every skunk is trappedfollows a Poisson distribution, but each animal must be observed at least once in order tobe recorded. The probability of a zero count in the Poisson distribution is e−λ so that thenon-zero, observable data then represents the 1 − e−λ fraction of the whole population.

The truncated Poisson distribution has probability mass function

P[Y = y |Y > 0] = e−λλy/y!(1 − e−λ) (7.1)

defined for y = 1, 2, . . . and parameter λ > 0.The 1 − e−λ term in the denominator of Equation 7.1 is the non-zero fraction of the

population that can be observed. The unobservable skunks of Table 7.1 are not included inEquation 7.1.

The expected value of the discrete distribution in Equation 7.1 is

µ = µ(λ) = λ/(1 − e−λ), (7.2)

and is always larger than λ because values of y = 0 are excluded. Similarly, µ(λ) is alwaysgreater than 1 for all values of λ. In order to program the truncated Poisson distributionat Equation 7.1, the function µ(λ) must be specified as the inverse link (INVLINK) inGENMOD. The function µ(λ) in Equation 7.2 is plotted in Figure 7.1. When λ is large,the dashed line shows that λ and µ(λ) are almost equal.

Figure 7.1The expected value of the truncated Poisson distribution µ(λ) is given at Equation 7.2.

0

2

4

µ(λ)

0 2 4

λ

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

..........................

.............

.................................................................................

..........................................................................

....................................................................

.................................................................

.............................................................

..........................................................

.........................................................

......................................................

......................................................

...................................................

..................................................

..................................................

.................................................

.................................................

..............................................

...............................................

...............................................

...................

Suppose y1, . . . , yn are a sample from the truncated Poisson distribution (Equation 7.1),all with the same value of λ. Let y denote the sample average. A property of the maximumlikelihood estimate λ = λ(y) of λ is the value λ that solves the equation

y = µ(λ) = λ/(1 − e−λ). (7.3)

Equation 7.3 does not have a simple solution but it can be solved iteratively for λ usingthe %INV macro given in Program 7.1 and described below. The relation at Equation 7.3specifies the intuitive property that the average of the observed data must equal the expectedvalue of the fitted distribution. Equation 7.3 does not hold if the yi have different values


of λ. The following section uses Equation 7.3 to verify that Output 7.1 correctly fits thetruncated Poisson distribution to the skunk data.

In addition to the inverse link, the forward link (FWDLINK) is also needed in the GEN-MOD program. This requires that λ be specified as a function of the mean, µ. The forwardlink function does not have a simple expression and can also be computed iteratively usingthe %INV macro of Program 7.1. This macro finds the value of λ for any specified µ > 1using interval bisection.

The initial interval for λ is defined by the inequalities

µ − 1 ≤ λ(µ) ≤ µ

where µ = µ(λ) is given at Equation 7.2. The algorithm in %INV macro in Program 7.1finds the midpoint of this interval and determines whether λ is in the upper or lower halfof the interval. The interval containing the correct value of λ is then half as large. Thealgorithm continues in this fashion until λ is determined to lie in a suitably small interval.

The variance of the truncated Poisson distribution

VarY = {λ(λ + 1)(1 − e−λ) − λ2}/(1 − e−λ)2 (7.4)

is needed in the VARIANCE statement of the GENMOD procedure to fit this distribution.This is all of the mathematical background that is needed by GENMOD to fit these

models. The %TPR macro, which incorporates these formulas for use in a GENMOD pro-gram, is given in Program 7.1. Its use is illustrated in Programs 7.2 and 7.3. The remainderof this section describes useful results that can be calculated from the GENMOD outputwith the use of the OBSTATS option.

The following program fits a truncated Poisson regression model to the skunk data us-ing the %TPR macro of Program 7.1. Portions of the output from this program appear inOutputs 7.1 and 7.2.

Program 7.2 data skunk;input count freq year sex $ @@;yr=year; sx=sex; /* make copies of YEAR and SEX variables */label

count =’count’freq = ’frequency’year = ’1977 or 1978’sex = ’F M’ ;

datalines;1 1 77 F 2 2 77 F 3 4 77 F 4 2 77 F 5 1 77 F1 3 77 M 2 0 77 M 3 3 77 M 4 3 77 M 5 2 77 M6 2 77 M 1 7 78 F 2 7 78 F 3 3 78 F 4 1 78 F5 2 78 F 1 4 78 M 2 3 78 M 3 1 78 M

run;proc genmod;

%tpr; /* invoke the %TPR macro */frequency freq;class sex year;model count = sex year year*sex / obstats type1 type3;ods listing exclude obstats; /* omit obstats output */ods output obstats=fitted; /* Create fitted value dataset */

run;data fit2; /* omit class variables because these are defined */

set fitted; /* in two different ways */drop sex year;

run;


data both; /* Combine two data sets */merge skunk fit2; /* Merge fitted & original values */lambda=exp(xbeta); /* Estimate of lambda parameter */p0 = exp(-lambda); /* Est of probability of zero category */lamup = exp(xbeta + 1.96*std); /* 95% CI for lambda */lamlow = exp(xbeta - 1.96*std);p0up = exp(-lamlow); /* 95% CI for p0 */p0low = exp(-lamup);se=std*lambda; /* SE of lambda from delta-method */sep0=std*lambda*p0; /* SE of p0 from the delta method */

run;data; /* Remove duplicated values */set both;by yr sx;if not ( first.yr | first.sx ) then delete;

run;proc print noobs;

var yr sx lambda se lamlow lamup p0 sep0 p0low p0up;run;

The fitted linear estimator of log(λ) is denoted by XBETA and is equal to

xbeta = log(λ) = α + β1 X1 + β2 X2 + · · ·for every observation. The values (α, β1, β2, . . .) denote the maximum likelihood esti-mates of the regression coefficients (α, β1, β2, . . .) obtained by GENMOD. The Type 1and Type 3 likelihood ratio statistics can be requested to test the statistical significance ofeach of the regression coefficients β1, β2, . . . . These different statistical significance levelsare explained in Section 2.5.2.

The estimated standard error of XBETA is denoted by STD in the OBSTATS outputof the GENMOD procedure. The values of XBETA and STD are used to estimate thetruncated Poisson parameter λ for every combination of covariates such as year and sex.Confidence intervals are easily obtained as well. Specifically,

λ = exp(xbeta)

is the estimated value of λ for any given set of covariate values X1, X2, . . . .

Approximate 95% confidence intervals for λ are given by

λU = exp(xbeta + 1.96 std)

and

λL = exp(xbeta − 1.96 std)

as the upper and lower endpoints, respectively.An estimate of the standard error of λ can be obtained from the well-known delta

method. This estimate is

Standard error of λ = λ × std,

and is computed as the variable SE in Program 7.2.The estimate λ and its confidence interval can be used to estimate the proportion of

individuals not observed using the fitted truncated Poisson distribution. The point estimate

p0 = exp(−λ)


is the estimated proportion of observations equal to zero in the full (non-truncated) Poissondistribution and is unobservable in the truncated distribution.

An approximate 95% confidence interval for p0 is

pU = exp(−λL)

and

pL = exp(−λU)

as the upper and lower endpoints.The delta method provides an estimate of the standard error for p0 as

Standard error of p0 = p0λ × std.

These values are denoted by SEP0 in Program 7.2.

7.3 Truncated Poisson Models with Covariates

Program 7.2 estimates regression coefficients and tests whether they are zero. Then, itproduces estimates of the number of unobserved skunks. Output 7.1 summarizes the fittedparameter estimates.

Output 7.1




Error

Wald 95%Confidence


Intercept 1 0.0626 0.2807 -0.4875 0.6127 0.05 0.8236

sex F 1 0.5560 0.3158 -0.0630 1.1749 3.10 0.0783

sex M 0 0.0000 0.0000 0.0000 0.0000 . .

year 77 1 1.1679 0.3152 0.5502 1.7857 13.73 0.0002

year 78 0 0.0000 0.0000 0.0000 0.0000 . .

sex*year F 77 1 -0.7492 0.3889 -1.5115 0.0130 3.71 0.0540

sex*year F 78 0 0.0000 0.0000 0.0000 0.0000 . .

sex*year M 77 0 0.0000 0.0000 0.0000 0.0000 . .

sex*year M 78 0 0.0000 0.0000 0.0000 0.0000 . .

Scale 0 1.0000 0.0000 1.0000 1.0000



Program 7.2 also estimates one truncated Poisson parameter λ for each of the four yearand sex combinations. These are given at the bottom of Output 7.2. Each parameter λ isestimated, along with a corresponding approximate 95% confidence interval. There arealso four estimated probabilities (p0) of skunks never observed. These probabilities aregiven along with estimates for their 95% confidence intervals.




Intercept 67.8349

sex 67.1047 1 0.73 0.3928

year 57.4905 1 9.61 0.0019

sex*year 55.1902 1 2.30 0.1293




sex 1 0.51 0.4762

year 1 11.16 0.0008

sex*year 1 2.30 0.1293

yr sx lambda se lamlow lamup p0 sep0 p0low p0up

77 F 2.82144 0.49642 1.99849 3.98326 0.05952 0.02955 0.01862 0.13554

77 M 3.42306 0.49069 2.58460 4.53353 0.03261 0.01600 0.01074 0.07543

78 F 1.85622 0.26867 1.39773 2.46511 0.15626 0.04198 0.08500 0.24716

78 M 1.06459 0.29882 0.61412 1.84547 0.34487 0.10305 0.15795 0.54111

A summary of the parameter estimates in Outputs 7.1 and 7.2 shows that there are largedifferences between the two years but there is a small sex difference. The effects of sex andthe year by sex interaction are not statistically significant in the Type 1 or Type 3 analyses.Confidence intervals of the four estimated λ parameters are given in Output 7.2 for all fourcombinations of year and sex.

The estimated unobservable fractions (p0) are given along with their 95% confidenceintervals in Output 7.2 for each of the four sex and year combinations. The 1977 censusseems to be more complete with smaller estimated proportions of missing skunks p0 than1978. The estimated rates for missing animals in 1977 are 3% for male and 6% for femaleskunks. The estimates indicate that in 1978, 16% of the females and 34% of the maleskunks were never observed.


To verify that GENMOD is correctly fitting the truncated Poisson distribution, checkthat the estimates of λ from Output 7.2 satisfy Equation 7.3. For example, the 1977 femalesample captured each individual skunk an average of y = 3.0 times. You need to show thatthe estimate λ = 2.82144 from Output 7.1 satisfies this relation. Specifically,

λ/(1 − exp(−λ)) = 2.82144/(1 − e−2.82144) = 3.00 = y

verifies that the model is correctly fitted by the GENMOD program. In other words, thisequation shows that the expected value of the fitted distribution is the same as the observedaverage number of times each skunk is trapped, provided that the skunk is captured at leastonce.

7.4 An Example with Overdispersion

The truncated Poisson distribution is applied in this section to another data set in which youwant to estimate the size of a closed population. The analysis of this data is similar to thatof the wildlife data of the previous section. A major difference is that the distribution hasvery long tails. An overdispersion parameter is estimated following the similar examplesgiven in Sections 1.5 and 5.3.

The data of Table 7.2 is part of a larger study of the problem of illegal immigrationin the Netherlands. In our unpublished abstract (van der Heijden et al. 1997) we reportthe frequencies of arrest records for illegal immigrants in the four largest cities of theNetherlands (Amsterdam, Rotterdam, The Hague, and Utrecht) for the year 1995. Arrestsare usually the result of a crime or motor vehicle violation. Individuals might be arrestedby the police and deported, only to reappear several weeks later and be arrested again.Others, once arrested, cannot be deported for a variety of reasons and are asked to leavethe Netherlands voluntarily.

TABLE 7.2 Observed and expected re-arrest frequencies of illegal immigrants in the Netherlands.Individuals are classified by the effectiveness with which the authorities were able to deport them.Frequencies for 3 or more arrests were combined to calculate the chi-squared statistic. Source: van derHeijden et al. (1997).

Noteffectively Effectively Other and

Frequency expelled expelled missing

Obs Exp Obs Exp Obs Exp

1 2755 2693 2431 2423 654 6522 251 349 52 65 57 613 46 30 4 1 6 44 18 2 15 2 16 2

Totals 3074 2489 717Means 1.135 1.027 1.096

χ2 (1 df) 69. 27.6 1.27

Public domain: Material not copyright protected.

The process of deportation might or might not be effective because of a number offactors, such as the lack of cooperation of the nation of origin. The Schengen Treaty assuresthat the shared borders between a large part of the European Union are free and open.


Immigrants from Germany, for example, are always allowed to live in the Netherlands andcannot be declared illegal. Consequently, many individuals deported to these countries caneasily return. These open borders result in a classification of the effectiveness of expulsionof illegal immigrants.

The data of Table 7.2 contains a summary of the arrest and re-arrest frequencies. Theproblem of population size estimation corresponds to estimating the number of illegal im-migrants who are potentially subject to arrest by the police. The numbers of such indi-viduals can be estimated separately for each of the three categories. The mean number ofarrests reflects the differences in expulsion categories. Those not effectively expelled havethe highest mean number of re-arrest rates and those who were effectively expelled havethe lowest rates of re-arrest. The means for those in the Other and Missing category fallin between these two rates.

Table 7.3 gives the estimated values for the truncated Poisson parameter λ and the esti-mated proportion p0 who are in the zero frequency category and are unobservable. The cor-responding 95% confidence intervals are also in Table 7.3. The estimated values of λ andp0 are obtained from a GENMOD program very similar to the one given in Program 7.2.

TABLE 7.3 Estimated parameters and their upper and lower 95% confidence intervals for the illegalimmigrant data of Table 7.2. Small values of the c parameters correspond to high degrees ofoverdispersion and c at +∞ coincides with a truncated Poisson model.

Noteffectively Effectively Other and

Parameter expelled expelled missing

λ 0.259 0.053 0.187λL 0.246 0.047 0.165λU 0.272 0.060 0.211p0 0.772 0.948 0.830pL 0.762 0.942 0.810pU 0.782 0.954 0.848c 0.611 0.059 11.02cL 0.0006 6 × 10−5 .003cU 3.32 0.373 +∞

Individuals who are classified as Effectively expelled represent a relatively smallproportion of the observed data. The estimated parameter values of Table 7.3 indicate thatabout 95% of such individuals do not appear in the recorded data. In other words, thenumber of similarly classified individuals is about 19 times greater than the number thatappear in Table 7.2. That is, these models estimate that there are about 50,000 individualswho could be classified under the Effectively expelled category, but who have notbeen observed.

Individuals classified as Not effectively expelled appear to be missing about75% of the time. So about 3 times as many people are missing as are recorded inTable 7.2. The Other and missing category of immigrants appears to be similar tothe Effectively expelled classification.

If we combine all frequencies of 3 or more, then the chi-squared goodness of fit testhas 1 df. Specifically, categorical data with three categories loses 1 df for the sample sizeand another df for the fitted parameter λ. The fit is generally poor except in the Otherand missing category due to the extremely long tails of the observed distributions. Ap-parently a very small number of individuals are deported, only to be arrested again severalmore times. Similarly, most individuals arrested once are then likely to be more wary andsuccessfully manage to evade authorities in the future.


This reasoning leads one to believe that there are unmeasured covariates and/or a largedegree of heterogeneity in the data of Table 7.2. A discussion of the possibly unmeasuredcovariates appears at the end of this section. The lack of fit and large chi-squared values canbe used to provide measures of overdispersion using the same methods described throughthe numerical examples in Sections 1.5 and 5.3.

It is not possible to simply fit a negative binomial model to this data by using dist=nbin GENMOD. The negative binomial distribution has longer tails than the Poisson distri-bution and is only suitable if the zero frequencies are observable. The negative binomialdistribution is defined on non-negative counts 0, 1, . . . and is not appropriate for this datawith unobservable counts. Instead you should use the approach for overdispersed data thatis described in Section 1.5.

The sum of the combined chi-squared statistics for all three immigrant groups has 3 df.A 3 df chi-squared variate has a mean of 3 and a 95% confidence interval of (.216, 9.35).The estimated overdispersion parameter c in Equation 1.8 that matches χ2(c) with thismean is .172 and the corresponding 95% confidence interval for this parameter is (.011,.727). Smaller values of c are indicative of greater degrees of overdispersion, so theseestimates indicate a large amount of heterogeneity in the data set.

Separate estimates of the overdispersion parameters c for each of the three populationsare listed in Table 7.3. Very small estimates of c in the two expelled categories indicatethat there is a large amount of heterogeneity in these populations. The large estimatedvalue of c in the Other and missing category indicates that individuals in this groupare similar to each other and their frequencies are adequately explained by the truncatedPoisson distribution. Confidence intervals for the overdispersion parameters in each of thethree categories are very wide. The infinite upper limit of the parameter for the Otherand missing category of immigrants indicates that the chi-squared and the fitted modelin Table 7.2 adequately explain the observed data.

The heterogeneity might be due to missing covariates such as the sex of the individualor whether the individual has a prior criminal record. It is often difficult to distinguishbetween the case of data with overdispersion or whether there is a missing covariate inthe data set. Lambert and Roeder (1995) develop a set of measures to distinguish betweenthese situations.

In summary, a very large number of illegal immigrants are apparently missing from thisdata. Immigration officials were not surprised to read our report that the number of ille-gal immigrants might be much higher than the number of those individuals recorded inthe arrest records of Table 7.2. The individuals who are arrested and deported are usuallyunemployed young men who are arrested for other types of criminal activity, chiefly mo-tor vehicle violations, assault, drugs, and illegal weapons. These young men are greatlyoverrepresented in Table 7.2. The vast majority of illegal immigrants hold jobs or workas domestics in private homes and are not visible during much of the day. The surveymight then be refined so that the target population includes only those individuals who arepotentially subject to arrest by the police for criminal activity.

7.5 Diagnostics and Options

The following example does not try to estimate the size of a closed population. Instead ituses the truncated Poisson distribution to model data for which the Poisson distribution isapplicable but the zero frequencies have been omitted. It uses GENMOD to fit a truncatedPoisson distribution as a generalized linear model with several covariates. A number ofuseful diagnostic measures and plots are introduced as well.

The New Haven Register (Aug. 17, 1995) listed some of the towns near New Haven andthe number of major lottery winners in each. The article did not give the demographic in-


formation (population) that is supplied in Table 7.4. That data is based on the 1980 census.The area of each town is measured in square miles. The mill rate determines how much taxis paid on commercial and residential real estate. The mill rate gives a rough indication ofproperty values: lower rates generally appear in towns with higher property values.

Every town listed in Table 7.4 contains at least one winner. Towns without winners wereexcluded from the data given in the newspaper article. In this example there is little interestin estimating the size of the lottery playing population. It is more interesting to describethe effects of the three covariates (population, area, mill rate) and how they are related tothe lottery playing habits of the people living in these towns. Another variable examinedis the population density (=population divided by area) as a measure of urban versus ruralconcentration of each town.

TABLE 7.4 Major lottery winners in towns near New Haven. Towns with no winners were omitted fromthe original newspaper article. Source: Zelterman (1999, pp. 38–40).

Number Population Area in Property taxTown of winners in thousands sq. miles mill rate

Ansonia 6 17.9 6.2 28.9Beacon Falls 3 5.3 9.8 25.0Branford 11 28.0 27.9 22.6Cheshire 6 26.2 33.0 27.1Clinton 2 12.8 17.2 27.9Derby 6 12.0 5.3 29.6East Haven 9 26.5 12.6 37.1Guilford 6 20.3 47.7 28.6Hamden 9 52.0 33.0 34.1Madison 5 16.0 36.3 22.3Milford 10 49.5 23.5 30.8N. Branford 2 13.1 26.8 26.9North Haven 12 21.6 21.0 23.4Old Saybrook 1 9.3 18.3 15.3Orange 9 12.5 17.6 23.8Oxford 3 9.1 33.0 29.0Seymour 1 14.5 14.7 40.5Shelton 7 36.0 31.4 21.6Trumbull 14 33.0 23.5 24.1West Haven 12 54.0 10.6 41.4Woodbridge 1 8.0 19.3 28.4

The Poisson distribution is valid here because of the binomial limit in which N is largeand the p parameter is small. The population of lottery players or the number of ticketssold is very large, but the probability of winning with each play is very small. The numberof winners in each town should then have an approximately Poisson distribution and themean λi should be proportional to the i th town’s population.

Figure 7.2 plots the number of winners in each town by the town’s population. Thereis a clear trend of an increasing number of winners with increasing town population. Thevariability of the number of winners also increases with the mean. The truncated Poissondistribution variance given at Equation 7.4 increases with its mean. This is also the casewith the usual (non-truncated) Poisson distribution.


Figure 7.2The number of lottery winners plotted against town population. The variance increases withthe mean.

20 40 60

Town population in thousands

0

5

10

15

Winners

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

t

Program 7.3 fits a truncated Poisson regression model to this lottery data and uses the%TPR macro described in Program 7.1. This program fits a regression model for log λi asa linear function of the covariates: log population, area, mill rate, and population density.Each λi is the truncated Poisson parameter for the i th town. The relationship between theexpected number of winners and λ is described in Section 7.3 and plotted in Figure 7.1.

Program 7.3 title1 ’Truncated Poisson regression to model lottery winners’;data lottery;

input town $ 1-14 winners popul area mill;dens = popul/area;logpop = log(popul);label

popul = ’population in 1000s’logpop = ’log population’area = ’in square miles’mill = ’property tax rate’dens = ’population density’;

datalines;Ansonia 6 17.9 6.2 28.9Beacon Falls 3 5.3 9.8 25.0


Woodbridge 1 8.0 19.3 28.4run;


title2 ’Fit log population as a covariate’;proc genmod;

%tpr; /* invoke the %TPR macros */model winners = mill area dens logpop / type1 type3

maxiter=100 /* increase number of iterations */intercept=.3; /* initial starting value for intercept */

run;title2 ’Fit log population using offset’;proc genmod;

%tpr; /* invoke the %TPR macros */model winners = mill area dens /

offset=logpop obstats type1 type3maxiter=100 /* increase number of iterations */intercept=.3; /* initial starting value for intercept */

ods output obstats=fitted; /* create fitted value data set */ods listing exclude obstats; /* do not print the obstats data */

run;data both;

merge lottery fitted; /* merge the fitted and original values */lambda=exp(xbeta); /* estimated lambda parameter */

run;title3 ’Plot of residuals from the fitted model’;proc gplot;

bubble reschi * pred = dens;run;

Two options are specified in Program 7.3 to help GENMOD converge to the correct so-lution of the regression coefficients. These are sometimes needed in complex models suchas the one described here and the hypergeometric regression discussed in Chapter 8. Oneof these options is the MAXITER= option, which specifies the number of iterations thatGENMOD is allowed perform before quitting. With the current fast computers, severalthousand iterations should only take a second or two. The second option is INTERCEPT=,which specifies the approximate value of the intercept in the final model. Even a roughguess can greatly speed up the convergence of the GENMOD fitting algorithm. The choiceof intercept=.3 was based on several trial and error runs of this program. The MAX-ITER= and INTERCEPT= options are also used in the programs of Section 8.4.

The first model fit by Program 7.3 is

log λi = α + β0logpop + β1area + β2mill + β3density

where LOGPOP is the logarithm of the population size. Some of the output from the pro-gram fitting this model appears in Output 7.3. In this model of log(λi ) as a linear functionof covariates, the fitted regression coefficient of log population is .957 (SE=.28) or almostidentically one. This fitted model has a good fit: (χ2 = 22.16; 16 df; p = .14).





Deviance 16 27.8810 1.7426








Error

Wald 95%Confidence


Intercept 1 0.3281 0.6014 -0.8508 1.5069 0.30 0.5854

mill 1 -0.0418 0.0169 -0.0749 -0.0086 6.09 0.0136

area 1 -0.0148 0.0186 -0.0513 0.0217 0.63 0.4265

dens 1 0.0506 0.1696 -0.2819 0.3830 0.09 0.7657

logpop 1 0.9572 0.2788 0.4108 1.5036 11.79 0.0006

Scale 0 1.0000 0.0000 1.0000 1.0000


The same log-linear model with a restricted regression coefficient of 1 for the log pop-ulation is summarized in Outputs 7.4 and 7.5. This restriction has a negligible change infit: (χ2 = 22.27; 17 df, p = .17). This is a clear indication that the λ parameters are pro-portional to the town’s population. Program 7.3 uses the OFFSET= option in GENMODto force each town’s regression coefficient λi parameter to be proportional to that town’spopulation. There is virtually no change in the goodness of fit between the models withand without the OFFSET= restriction so this appears to be a reasonable savings of 1 df.





Deviance 17 27.8929 1.6408








Error

Wald 95%Confidence


Intercept 1 0.2622 0.5139 -0.7449 1.2694 0.26 0.6098

mill 1 -0.0416 0.0169 -0.0747 -0.0085 6.06 0.0138

area 1 -0.0168 0.0111 -0.0385 0.0049 2.29 0.1301

dens 1 0.0301 0.1012 -0.1682 0.2284 0.09 0.7659

Scale 0 1.0000 0.0000 1.0000 1.0000





Intercept 35.7806

mill 31.6839 1 4.10 0.0430

area 27.9042 1 3.78 0.0519

dens 27.8929 1 0.01 0.9153




mill 1 4.25 0.0393

area 1 2.50 0.1139

dens 1 0.01 0.9120


The statement with the OFFSET= option fits the model

log λi = α + β0 logpop + β1 area + β2 mill + β3 density

with the coefficient β0 identically equal to 1. The truncated Poisson parameter λi for thei th town can then be written as

λi = Population × exp(α + β1 area + β2 mill + β3 density)

so that each λi is proportional to the town’s population. Another example using the OFF-SET= option appears in Section 5.2.

The mean number of winners is not λ, but rather the function given at Equation 7.2. Thisnon-linear function makes it difficult to model the expected number of winners exactly asproportional to the population.

The expected number of winners should be proportional to the number of tickets sold,assuming, of course, that the chance of winning on each ticket is always the same. Thedata does not contain the number of tickets sold per town or per person, which mightalso be a better predictor of the number of winners than the population. Since you do nothave this data, you should model the truncated Poisson parameter of winners per town asproportional to that town’s population. This model is valid if you assume a constant rate ofticket sales per person across towns.

Outputs 7.4 and 7.5 give a summary of the GENMOD output from a fitted truncatedPoisson regression model with the offset log population and the linear covariates: area,mill rate, and population density. The log population has an extremely significant effect in

Figure 7.3Truncated Poisson chi-squared residuals plotted against predicted value. The sizes of thebubbles are proportional to the town’s population density.

−2

−1

0

1

2

ResChi

5 10

Predicted number of winners

Orange

West Haven

a

a

a

a

a

a

a

a

a

b

b

b

b b

c

cd

d

e

f

j


Output 7.3 (also visible in Figure 7.2) and is not mentioned in this output. The mill rateappears to be statistically significant in the Wald and both the Type 1 and Type 3 analyses.Higher mill rates are associated with fewer winners. The town’s area is only moderatelystatistically significant in the Type 1 analysis and not otherwise. The population density isnot useful in explaining the number of lottery winners.

Figure 7.3 is a bubble plot produced by the GPLOT procedure. This figure plots thechi-squared residuals for the model by the town population. The sizes of the bubbles aredetermined by the population density of each town. Larger bubbles are representative ofhigher population densities. There appears to be no lack of fit and no large outliers arerevealed. The largest absolute residual is 2.07 for the town of Orange. The town of WestHaven has the greatest population density and also has the largest population. This can beseen by the large bubble near the right edge of Figure 7.3.

There seems to be greater deviation from the model at smaller predicted values andpopulation sizes than in the larger towns. This is evidence that the largest towns exert alarge amount of leverage in the fit of the model. The three largest towns (Milford, Hamden,and West Haven) pull the fitted model towards themselves resulting in smaller residuals forthese observations. The smaller towns have less influence on the fit and also appear to bemore dissimilar to each other than the differences between larger towns.

To summarize, the truncated Poisson distribution is useful for estimating populationsizes and other settings where the zero frequencies of the Poisson distribution are notobservable. Log-linear models for this distribution can easily be fit in GENMOD usingthe macros provided in Program 7.1. Diagnostics such as residual plots and measures ofoverdispersion are also useful for examining this type of data.

Chapter

8The HypergeometricDistribution

8.1 Introduction 1298.2 Derivation of the Distribution 1318.3 Extended Hypergeometric Distribution 1368.4 Hypergeometric Regression 1398.5 Comparing Several 2 × 2 Tables 144

8.1 Introduction

This chapter describes a collection of methods that were originally developed for settings inwhich the sample sizes are too small for the chi-squared approximation of the Pearson chi-squared statistic to be valid. The basic idea behind these methods is to completely enumer-ate every possible categorical outcome conditional on fixed marginal sums of a 2×2 table.In precomputer days, this was a useful idea if there were only a few possible outcomesthat could be individually tabulated by hand. With modern computing, these methods arestill valid and there is nothing wrong with routinely enumerating many thousands of tables.Examples of such analyses are incorporated into the FREQ and LOGISTIC procedures inSAS. Sophisticated computer algorithms have been developed in recent years that allowfor enumeration of large, multidimensional tables. Such enumerations would have beeninconceivable at the time that these methods were first proposed. This chapter describeshow some of these ideas are implemented in the FREQ procedure. Section 8.4 developsa useful extension of the generalized linear models that are implemented in GENMOD tocover models of the odds ratios in 2 × 2 tables.

A 2 × 2 table of data in Table 8.1 is a good example of a candidate for applying thesemethods. (This data also appears in Table 2.1 but is repeated here as Table 8.1 for conve-nience.) Over the past century a tremendous amount of intellectual energy has gone intothe statistical analysis of 2 × 2 tables of counts such those given in Table 8.1. The Pearsonchi-squared statistic, for instance, dates back to the year 1900 and is taught in most elemen-tary statistics courses. Because of all this intellectual energy, a large number of summarystatistics result when you examine this table using PROC FREQ.

TABLE 8.1 Incidence of tumors in mice exposed to a fungicide. Source: Innes et al. 1969.


Mice with tumors 4 5 9No tumors 12 74 86

Totals 16 79 95



Program 8.1 produces the summary statistics that are available with PROC FREQ. Someof the output from this program is reproduced in Outputs 8.1 and 8.3. Output 8.3 is givenand explained in Section 8.3. These many results are from a data set consisting of onlyfour numbers. There is just one df so there is really only one free individual observation,after conditioning is performed on the various marginal totals. This one df is the one-dimensional hypergeometric random variable that is described in this section.

Program 8.1 data;do row=1 to 2;

do col=1 to 2;input count @@;output;

end;end;datalines;4 5 12 74

run;proc freq;

tables row * col/all;weight count;exact or;

run;

Output 8.1 The FREQ Procedure

Statistics for Table of row by col

Statistic DF Value Prob

Chi-Square 1 5.4083 0.0200

Likelihood Ratio Chi-Square 1 4.2674 0.0389

Continuity Adj. Chi-Square 1 3.4503 0.0632

Mantel-Haenszel Chi-Square 1 5.3514 0.0207

Phi Coefficient 0.2386

Contingency Coefficient 0.2321

Cramer’s V 0.2386

WARNING: 25% of the cells have expected counts lessthan 5. Chi-Square may not be a valid test.

Chapter 8 The Hypergeometric Distribution 131


The FREQ Procedure


Fisher’s Exact Test

Cell (1,1) Frequency (F) 4

Left-sided Pr <= F 0.9938

Right-sided Pr >= F 0.0411

Table Probability (P) 0.0349

Two-sided Pr <= P 0.0411

The portion of the output given in Output 8.1 contains a warning concerning the smallcell counts encountered in this table. The advice given by Cochran (1954) is that the chi-squared approximation to Pearson’s statistic should be valid as long as all expected cellcounts are five or greater. After almost half a century, this is still a good rule to follow.

The lines of Output 8.1 under the heading Fisher’s Exact Test are the subject ofSection 8.2. Exact tests, as their name implies, do not depend on asymptotic approxima-tions, such as the chi-squared distribution, in order to draw inferences from the data. Rather,they depend on completely enumerating all possible outcomes.

Specifically, concentrate on the single count of 4 in the upper-left cell of Table 8.1. Con-ditional on the margin sums of the table, this cell can only contain the counts {0, 1, . . . , 9}.There are then exactly ten tables that are consistent with the marginal counts of this table.Knowing any one single cell value in the ‘interior’ of the 2 × 2 table determines the countsin the other three cells, giving rise to the expression ‘one degree of freedom’.

Implicit in this reasoning is that the count in one cell contains all of the informationneeded to study the problem. Consequently, the marginal sums of the table are assumedto contain no information concerning the relationship between exposure to the suspectedcarcinogen and the subsequent development of lung cancer. The analysis that follows fixesthe values of these marginal totals.

The numbers of mice exposed and used as controls were fixed by the scientists in thelaboratory. In contrast, the number of mice developing tumors clearly has a random com-ponent to its observed value. Despite these very different mechanisms that give rise to thecounts in the marginal totals, this method treats them in an equivalent manner. You mightbe uncomfortable with this way of proceeding. There are other (chiefly Bayesian) analy-ses of this type of data that take into account the variablity of the marginal totals. Usefulreferences for further reading along these lines include Altham (1969) and Plackett (1977).

8.2 Derivation of the Distribution

This section derives the probability distribution of the counts in a 2 × 2 table conditionalon the marginal sums. Let X and Y denote independent binomial random variables withrespective parameters (n1, p1) and (n2, p2). In the data of Table 8.1, it is useful to thinkof X and Y as the numbers of exposed mice that develop tumors out of the n1 = 16 miceexposed and the n2 = 79 unexposed mice, respectively. Useful properties of the binomialdistribution are given in Section 1.2.

The null hypothesis of independence of rows and columns in a 2 × 2 table is the sameas the hypothesis of p1 = p2. Independence of tumor development and fungicide exposure


in this example is the same as saying that the rate of tumor formation is the same whetheror not the mice were exposed. The statistical test of significance is performed conditionalon the total number of mice with tumors X + Y . This number corresponds to the top rowsum of Table 8.1. The sum X + Y is denoted by the symbol m. There were N = n1 + n2total mice used in the experiment. The remaining notation for the 2 × 2 table is given inTable 8.2.

TABLE 8.2 Notation used for counts in a 2 × 2 table. The distribution of X is given at Equation 8.2.

Totals

X Y = m − X

n − X N − n − m + X

X + Y = m

N − m

Totals n = n1 N − n = n2 N = n1 + n2

Section 1.2 explains that the distribution of X + Y is binomial with parameters n1 + n2and p where p1 and p2 are equal to this common value. The conditional distribution of Xgiven X + Y = m, is then

Pr[X = x | X + Y = m] = Pr[X = x and Y = m − x]/ Pr[X + Y = m].Since the binomial variates X and Y are independent,

Pr[X = x | X + Y = m] = Pr[X = x] Pr[Y = m − x]/ Pr[X + Y = m].If you now substitute the expressions for the various binomial distributions,

Pr[X = x | X + Y = m] =

{(n1

x

)px (1 − p)n1−x

}{(n2

m − x

)pm−x (1 − p)n2−m+x

}(

n1 + n2

m

)pm(1 − p)n1+n2−m

(8.1)

While Equation 8.1 is formidable, there are simple explanations for all of the termsinvolved. The two {·} terms in the numerator are the probabilities of the pair of independentbinomial variates X and Y = m − x . The denominator is also a binomial probability ofX + Y = m with parameters n1 + n2 and p.

A remarkable feature of Equation 8.1 is that every appearance of p and (1−p) cancels inthe numerator and the denominator. The exact test of equality for two binomial proportionsdoes not require knowledge of their common value. All of the information about the pparameter is contained in the marginal totals of the 2 × 2 table. The test of the hypothesisbased on Equation 8.1 only depends on the single count x , which is consistent with thedescription of a 2 × 2 table having 1 df.

The distribution resulting from Equation 8.1 can be written as

Pr[X = x | m] =(

n

x

)(N − n

m − x

)/(N

m

), (8.2)

using the simpler notation: n = n1 and N = n1 + n2 as the total sample size. Refer toTable 8.2 for clarification.

Equation 8.2 is the probability mass function of the hypergeometric distribution. A tableof the values of this distribution with parameters (n, m, N ) for the fungicide data is givenin Table 8.3.

It is possible to rearrange the various factorials in the binomial coefficients of Equa-tion 8.2 to obtain the equivalent expression

Pr[X = x] =(

m

x

)(N − m

n − x

)/(N

n

). (8.3)


TABLE 8.3 Enumeration of all possible outcomes for the fungicide data using the probabilitydistributions Equation 8.2 or 8.3. The observed table is indicated. Values of chi-squared are calculatedfor all possible outcomes. The probabilities Pr[X] for each value of X are plotted as λ = 1 in Figure 8.1.

CumulativeX Pr[X] probability Pearson χ2

0 .1752 .1752 .901 .3553 .5304 .002 .2960 .8265 .003 .1325 .9589 .85

obs: 4 .0349 .9938 3.455 .0056 .9994 7.806 5.4 × 10−4 .9999 13.917 3.0 × 10−5 .999961 21.778 8.7 × 10−7 .999999 31.389 9.7 × 10−9 1 42.75

Compare Equations 8.2 and 8.3 to see that this distribution makes no distinction betweenrows and columns. The basic experiment summarized in Table 8.1 views the risk of tumorformation or not (rows) as a function of exposure to the fungicide (columns). The numberof mice exposed is controlled by the experimenter but the number of mice subsequentlydeveloping cancer is a response to this controlled condition. The exact methods describedin this chapter, however, are conditional on the number of mice with tumors and make nodistinction between a cause and effect relationship between the rows and columns. As withmany other measures of association in 2 × 2 tables such as the chi-squared statistic, exactmethods do not separately view the rows and columns in a causal relationship, but ratherpoint to a statistical event that is unlikely to be attributed to chance alone. It is up to thestatistician to make the appropriate interpretation of the statistical inference. In the case ofthe fungicide data of Table 8.1, for example, statistics alone cannot be used to prove thatfungicide exposure causes cancer.

The range of the hypergeometric random variable X with probability distribution inEquation 8.2 or 8.3 is defined in such a manner that all four cells of Table 8.2 are non-negative. Specifically,

max(0, m + n − N ) ≤ X ≤ min(m, n). (8.4)

For the data of Table 8.1, this range corresponds to the set of values {0, 1, . . . , 9}. Again,notice how this 2 × 2 table is defined by a single count. The probability distribution for the2 × 2 table is represented by a one-dimensional random variable.

The expected value of X in the hypergeometric distribution given at Equation 8.2 is

E X = mn/N

just as the expected count would be estimated for use in the Pearson chi-squared statistic.The variance of the hypergeometric distribution

Var X = mn(N − n)(N − m)/N 2(N − 1) = E(X) (N − m)(N − n)/N (N − 1)

is smaller than the mean.The easiest way to tabulate this distribution is through the use of the FREQ procedure.

Useful output given by the FREQ procedure appears in Output 8.1. Macro programs toexamine additional features of this distribution are given in Table 8.4 later in this chapter.

A common misconception is that the probability obtained at Equation 8.2 or 8.3represents the significance level of the observed table. This value is called the Table


Probability (P) in the PROC FREQ output of Output 8.1. In the case of the fungicidedata in Table 8.1, the probability of the observed table is

Pr[X = 4] = .0349.

This number is not the statistical significance level or the p-value. Instead you must enu-merate the probability of this table and those of all others in the tail of the observed table.As explained in Section 8.1, this enumeration of all points in the tail of the observed tableis often referred to as Fisher’s exact test or more generally, as the exact test.

The use of the word ‘exact’ does not bestow a higher virtue to these methods and cer-tainly does not mean to imply inferior qualities of other competing methods. Instead theword ‘exact’ refers to the enumeration of all possible sample events. As mentioned in Sec-tion 8.1, this process is in contrast to asymptotic methods such as those of the usual chi-squared test that rely on approximating the test statistic by its large-sample distribution.

Exact tests are generally viewed as a reasonable approach to significance testing andseveral examples are given below describing their implementation. Unfortunately, there isno complete agreement as to a precise definition of the tail of the distribution. The hyper-geometric random variable with the distribution given at Equation 8.2 is a one dimensionaldiscrete valued random variable. So you can easily enumerate all possible values it canattain. These are given for the fungicide data in Table 8.3 along with the correspondingprobability of each possible discrete outcome for the given marginal sums of the 2 × 2table.

The tail area for this table of data could be defined as the probability of any value of Xequal to or above the observed count. Similarly, the tail could also be any possible valueof X equal to or smaller than the observed value. For this example, the FREQ proceduregives these values as

Pr[X ≥ 4] = .0411

and

Pr[X ≤ 4] = .9938

in Output 8.1. These two values sum to 1.0349 or one plus the probability of the observedvalue of X = 4, which is counted twice in this sum.

Another possible definition of the tail is any table whose probability is smaller thanthat of the observed table. The FREQ procedure prints this as the two-sided significancelevel. The test is two-sided because smaller probabilities than that of the observed countcan occur above or below the observed count. This p-value is .0411, and it is also the uppertail. This value is given on the last line of Output 8.1.

Yet another possible definition of the tail might be in terms of a test statistic on theoriginal table. You can define the tail of the distribution as any 2 × 2 table with as largeor larger value of either the chi-squared or deviance statistics. Such an example can beperformed in the FREQ procedure using the EXACT statement:

proc freq;tables row * col/all;exact chisq;weight count;

run;

This program considers all possible values of several useful statistics. All of the valuesfor the chi-squared statistic are listed in Table 8.3 and this chapter uses this statistic toillustrate how to calculate its exact tail area. The relevant output from this program appearsin Output 8.2.


Output 8.2 The FREQ Procedure


Pearson Chi-Square Test

Chi-Square 5.4083

DF 1

Asymptotic Pr > ChiSq 0.0200

Exact Pr >= ChiSq 0.0411

The observed value of χ2 = 3.45 is compared with all possible values this statisticcould attain. The p-value that PROC FREQ calculates is the probability that such a valueor larger of this statistic could be achieved. This probability is not the same as the p-valueobtained when you compare the observed chi-squared statistic to its asymptotic, theoreticaldistribution. The exact tail area for the statistic in this example is the same as the p-valuefor the Fisher exact test. The FREQ procedure also produces exact tail area p-values forthe deviance (G2) and other popular test statistics.

These different definitions of the tail area might not coincide. A table in the tail ac-cording to one criteria, such as a smaller probability than that of the observed table, isnot necessarily in the tail by having a larger value of chi-squared for example. A detailedexample of the inconsistency of these definitions is given in Zelterman (1999, p. 36).

In general, exact tests for tables with small counts have come under some criticismbecause they tend to give levels of significance that are larger than those that might beprovided by competing methods. That is, exact tests tend to be conservative by not rejectingthe null hypothesis as often as it should, and, consequently, they appear to suffer from aloss of power.

One approach to improve on the conservative nature of exact tests is to develop a con-tinuity correction that better reflects the true significance level in the tail area. For theexample in this section, the details of calculating the upper tail area show

Pr[X ≥ 4] = Pr[X = 4] + Pr[X = 5] + Pr[X = 6] + · · ·= .0349 + .0056 + 5.4 × 10−4 + · · · = .0411

so that the largest contribution to this series is that of the observed table itself.A continuity corrected exact tail area puts less weight on the probability of the observed

table. Such a tail area might then be defined as

corrected tail = 1/2 Pr[X = 4] + Pr[X > 4]= (.0349)/2 + .0056 + 5.4 × 10−4 + · · · = .0236.

This significance level of .0236 is much closer to the p-value of .0200 for the chi-squaredtest obtained by the FREQ procedure in Output 8.1.

However, the continuity corrected exact test is just one more possible definition of thetail area. Fortunately, all of these different definitions do not lead to drastically differentconclusions for the fungicide example. This might not always be the case. In Zelterman,Chan, and Mielke (1995) we show that different significance tests can lead to drasticallydifferent p-values on the same categorical data. Every definition and statistical methodhas its good and bad points. The assumptions and approximations of some methods aremore applicable in certain settings than others. It is important for you to recognize thesestrengths and weaknesses when you use any statistical method.


There are a wide variety of definitions of the tail area and significance levels for amultitude of useful statistics. The methods developed in this section are concerned withthe null hypothesis of independence of rows and columns in a 2 × 2 table. The followingsection derives a related distribution that is used to describe the behavior of counts in a2 × 2 table when independence does not hold under the alternative hypothesis.

There are other generalizations of exact methods to tables larger than 2 × 2. The FREQprocedure has the option of enumerating all possible discrete outcomes in tables larger than2 × 2 for fixed sets of marginal totals. These methods are not explained here. If you areinterested, refer to the SAS documentation for the FREQ and the LOGISTIC proceduresfor details and useful technical information.

8.3 Extended Hypergeometric Distribution

The hypergeometric distribution illustrated in the previous section is the probability modelfor the null hypothesis of independence of rows and columns in a 2×2 table. Similarly, thatdistribution is developed as the conditional distribution of X given X + Y = m where Xand Y are independent binomial counts with the same probability parameter. The develop-ment of the extended hypergeometric distribution in this section follows the derivation ofEquation 8.2 except that the assumption of p1 = p2 is not used. Specifically, the methodsof this section describe the alternative hypothesis of p1 = p2. This extension of the basichypergeometric distribution is useful as a model for the alternative hypothesis in whichrows and columns are dependent. This dependence is expressed in terms of the odds ratiofor the 2 × 2 table.

Let X and Y denote independent binomial (n, p1) and (N − n, p2) random variables,respectively. This section continues to use the notation of Table 8.2. Except, in this casethe distribution of X + Y does not behave as a binomial. As a result, the denominator ofthis distribution, given at Equation 8.6 below, does not have a simple expression.

Begin as you did at Equation 8.1 with

Pr[X = x | X + Y = m] = Pr[X = x and Y = m − x]/ Pr[X + Y = m].Since X and Y are independent you can write

Pr[X = x | X + Y = m] = Pr[X = x] Pr[Y = m − x]/ Pr[X + Y = m].The denominator can be expressed as

Pr[X + Y = m] =∑

j

Pr[X = j and Y = m − j]

The independence of X and Y enables you to write

Pr[X + Y = m] =∑

j

Pr[X = j] Pr[Y = m − j].

Now substitute expressions for the various binomial distributions, giving

Pr[X = x | X + Y = m] ={(n

x

)px

1 (1 − p1)n−x

}{(N − n

m − x

)pm−x

2 (1 − p2)N−n−m+x

}∑

j

{(n

j

)p j

1(1 − p1)n− j

}{(N − n

m − j

)pm− j

2 (1 − p2)N−n−m+ j

} . (8.5)

The numerator of Equation 8.5 is the product of the probabilities of the two independentbinomial distributions. The denominator is the sum of j over all possible values of the


numerator. The range of X = j is given at Equation 8.4. These values are used so that allfour entries in Table 8.2 are non-negative.

Equation 8.5 can be greatly simplified by a change in the parameters. Define the oddsratio by

λ = {p1(1 − p2)}/{(1 − p1)p2}.The probability distribution of the extended hypergeometric distribution can then be

written as

Pr[X = x] =(

n

x

)(N − n

m − x

)λx

/∑j

(n

j

)(N − n

m − j

)λ j . (8.6)

The discrete distribution given here is a proper probability mass function and sums to 1over x . The denominator does not have a simple expression. Similarly, there are no simpleexpressions for the mean or variance of this distribution for general values of the parame-ter λ. This distribution can easily be computed and SAS macros that fit it are provided inTable 8.4 of Section 8.4. When λ = 1, the binomial probabilities p1 and p2 are equal andthe distributions in Equations 8.2 and 8.6 coincide.

The distribution given at Equation 8.6 is plotted in Figure 8.1 for different values of λ

and marginal sum parameters N = 95, n = 16, and m = 9, corresponding to the fungi-cide data example of Table 8.1. The distribution is defined on the integers 0, . . . , 9 for thisset of marginal sums. The mass points are connected for clarity, but keep in mind that thisdistribution is discrete valued. The value of λ = 1 corresponds to the usual hypergeometricdistribution described at Equation 8.2 in Section 8.2. Values of λ less than one have mostof their mass at the low end of the range. A value of λ = .25 produces a distribution with

Figure 8.1Extended hypergeometric mass functions in the fungicide data for various values of the oddsratio λ. Mass points of this discrete distribution are connected with lines. The (nullhypothesis) distribution with λ = 1 is given in Section 8.2 at Equation 8.2. The maximumlikelihood estimated value of λ is 4.815.

0 3 6 9

X

0

.3

.6

Pr[X]

λ = .25

λ = 1

λ = 4.815 λ = 12

λ = 125

.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

........................................................................................................ ............. ............. ............. ............. .............

..........................

..........................

..........................

..........................

..........................

..........................

............. ............. ............. ............. ............. ............. ............. ............. ............. .....................................................................

..................................................................

......................

......................

.....................................................................................................................................

..........................

.......................

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

............

...........

......

......

......

......

......................................

..........

...........................................................................................

. . . . . . . . . . . . . . . . . . . ....................................


most of its mass at X = 0. Similarly, large values of the odds ratio (such as λ = 125)

produce distributions with almost all of their probability mass located at the upper rangeof the distribution at X = 9.

There are several methods for estimating λ in Equation 8.6 and three of the most popularare given here. The simplest of these is the empirical odds (cross product) ratio of the 2×2table. This statistic is also discussed in Section 2.2.2.

For the fungicide data, the empirical odds ratio is

(4 × 74)/(12 × 5) = 4.9333

This value is calculated by PROC FREQ and given in Output 8.3. Program 8.1 producesthis output.

Maximum likelihood is another popular procedure for estimating parameters. Maximumlikelihood for log-linear models is also described in Section 2.6. The maximum likelihoodestimator λ of λ is the value of the parameter that maximizes the probability of the ob-served data. In the fungicide example of Table 8.1, λ is the value of λ that maximizes theprobability

Pr[X = 4 | λ, N = 95, m = 9, n = 16]in Equation 8.6. This value is λ = 4.815, which is very close to the value of the empiricalodds ratio given above.

An easy way to calculate the maximum likelihood estimate uses the macros devel-oped in this chapter and given in Program 8.4. The following SAS program calls upon the%HYPINV macro to accomplish this:

data mle;%hypinv(9,16,95,4); /* row, column, total, count of table */lamex=exp(lamd); /* %hypinv returns log of mle as lamd */

run;proc print;

var lamex;run;

The %HYPINV macro takes four arguments: the row sum m, column sum n, table totalN , and the single count X . The variable LAMD is created and contains the logarithm ofthe maximum likelihood estimate λ of λ.

The Mantel-Haenszel estimator of λ is also a popular method for estimating a singlecommon odds ratio across a set of 2 × 2 tables. This estimator is also available in theFREQ procedure. In a single 2 × 2 table, the Mantel-Haenszel estimator is the same as theempirical odds ratio. The methods of Section 8.4 describe a way to model a set of oddsratios that vary across 2 × 2 tables.

Exact and asymptotic confidence intervals for λ are also calculated with the FREQ pro-cedure. The upper (λU) and lower (λL) 95% confidence interval endpoints for the fungicidedata are those parameter values that solve the equations

Pr[X ≤ 4 | λL] = .975

and

Pr[X ≥ 4 | λU] = .975

where Pr[X | λ] is given at Equation 8.6.These two equations would be extremely tedious to solve by hand but their solutions are

easily obtained using Program 8.1. The EXACT statement with the OR option provides theexact 95% confidence intervals for the λ parameter. The relevant portion of the output fromthis program is given in Output 8.3.


Output 8.3 The FREQ ProcedureStatistics for Table of row by col

Odds Ratio(Case-Control Study)

Odds Ratio 4.9333

Asymptotic Conf Limits

95% Lower Conf Limit 1.1579

95% Upper Conf Limit 21.0182

Exact Conf Limits

95% Lower Conf Limit 0.8341

95% Upper Conf Limit 26.1606

This output also includes asymptotic confidence intervals for this parameter. The ex-act 95% confidence interval calculated by PROC FREQ in Output 8.3 contains the valueλ = 1 but the asymptotic interval does not. The value of λ = 1 corresponds to the nullhypothesis of independence of fungicide exposure and tumor outcome. This means thatthe relationship between fungicide exposure and tumor development just barely attains astatistical significance at the .05 level.

The following section develops methods for modeling log(λ) as a linear function ofcovariate values. In this manner the odds ratio λ can vary in a predictable manner betweenseveral 2 × 2 tables.

8.4 Hypergeometric Regression

In this section the log of the odds ratio parameter λ in the extended hypergeometric distri-bution (Equation 8.6) is modeled as a linear function of covariates. Just as the Poissonregression models the parameter of the Poisson distribution, these methods are called hy-pergeometric regression. Typical examples of the application of these methods appear inthe analysis of case-control studies. In such studies, a sequence of 2 × 2 tables are con-structed and the odds ratio parameter λ is the quantity of interest. The example presentedin this section is an examination of such a case-control study. A second example presentedin Section 8.5 shows how the odds ratio parameter of the hypergeometric distribution canbe modeled as a function of multiple covariate values.

An example of a case-control study linking myocardial infarction (heart attack) withoral contraceptive use given is by Shapiro et al. (1979) and is summarized in Table 8.5.Women who had an infarction were matched with other healthy controls of similar agesand asked whether they had ever taken oral contraceptives, a suspected risk factor in heartdisease. The marginal table demonstrating this relationship is given in Table 8.4. The strongrelationship between infarction and oral contraceptive use in this table (χ2 = 5.84; 1 df;p = .02) has an odds ratio of 1.68, which provides strong evidence that contraceptive useis associated with an increased risk of heart attack.


TABLE 8.4 The marginal association between myocardial infarction case-control status and oralcontraceptive use (χ2 = 5.84; 1 df; p = .02). The full data appears in Table 8.5. Source: Shapiro et al.1979.

Oral contraceptive use?

Yes No Totals

Case 29 205 234Control 135 1607 1742

Totals 164 1812 1976

Reprinted with permission from Elsevier Science (The Lancet, April 7, 1979, pp. 743–6).

TABLE 8.5 A case control study testing the relationship between myocardial infarction and oralcontraceptive use. The marginal frequencies are given in Table 8.4. Source: Shapiro et al. 1979.

Age group Oral contraceptive use? Log odds(year range) Yes No Totals ratio

1 (25–29) Cases 4 2 6Control 62 224 286 1.98

Totals 66 226 292

2 (30–34) Cases 9 12 21Control 33 390 423 2.18

Totals 42 402 444

3 (35–39) Cases 4 33 37Control 26 330 356 0.43

Totals 30 363 393

4 (40–44) Cases 6 65 71Control 9 362 371 1.31

Totals 15 427 442

5 (45–49) Cases 6 93 99Control 5 301 306 1.36

Totals 11 394 405

Reprinted with permission from Elsevier Science (The Lancet, April 7, 1979, pp. 743–6).

The next question asked by the medical researchers was whether or not the strength ofthis relationship (in terms of the odds ratio) is related to the age of the participant. Thedata, stratified by age, is presented in Table 8.5. At each age group the data forms a 2 × 2table. The logarithm of the empirical odds ratio is given for each age. A similar examplewith a related analysis appears in Section 3.2.

There are several ways to proceed with an analysis of the data in Table 8.5. The rela-tionship between the binary-valued contraceptive use as a function of age is best analyzedusing a logistic regression. Similarly, the change in risk of heart disease by age should alsobe modeled using logistic regression.

A model of the change in relationship between case-control and contraceptive use withage requires a different type of analysis. Specifically, the approach taken in this sectionasks whether the log odds ratios in Table 8.5 are related to the women’s ages. The waythis is done is to use the GENMOD procedure to regress the log odds ratio on the age


group number. The GENMOD program to perform this regression is given in Program 8.2.The models fitted by this program express log(λ) in Equation 8.6 as a linear function ofcovariates whose regression coefficients are estimated by GENMOD.

The following program fits the extended hypergeometric distribution to the myocardialinfarction case-control data of Table 8.5. Some of the output from this program is given inOutput 8.4.

Program 8.2 title1 ’Extended hypergeometric distribution in GENMOD’;title2 ’Oral contraceptive use and myocardial infarction, Shapiro 1979’;data mi;

input age cwoc ocuse cases total;label

cwoc = ’cases with oc use’ocuse = ’# in age stratum using oc’’s’cases = ’# of cases in age stratum’total = ’sample size in this age stratum’age = ’age category’loga = ’log-age category’;

loga=log(age);datalines;

1 4 66 6 2922 9 42 21 4443 4 30 37 3934 6 15 71 4425 6 11 99 405

run;title3 ’Model for log-odds ratio is linear in age’;proc genmod data=mi;

%fithyp(cases,ocuse,total); /* Provide table margins */model cwoc = age/obstats intercept=2

initial = -.30 to -.10 by .05 maxit=250000;run;title3 ’Model for log-odds ratio is linear in log-age’;proc genmod data=mi;

%fithyp(cases,ocuse,total); /* Provide table margins */model cwoc = loga/obstats intercept=2

initial = -.80 to -.60 by .05 maxit=250000;ods output obstats=fitted; /* Create fitted value data set */ods exclude listing obstats;

run;data both;

merge mi fitted; /* Merge fitted & original values */%hypmean(cases,ocuse,total,xbeta);mean=hypmean;sd=sqrt(hypvar);

run;proc print data=both;

var age loga mean sd total pred xbeta;run;

A series of macros are given in Program 8.4 that greatly facilitate fitting the extendedhypergeometric distribution in GENMOD. These macros are needed to compute the vari-ance, link, inverse link, and deviance of the distribution. Calculation of the inverse linkis straightforward and requires evaluation of the expected value of the distribution. Thedeviance is more difficult and requires the value of the log odds ratio that maximizes the


extended hypergeometric mass function (Equation 8.6). This parameter is obtained as theresult of an iterative procedure. The macros are somewhat lengthy but can be used withouttoo much difficulty as shown next.

The basic form of the GENMOD program to fit the hypergeometric distribution usingthese macros is

proc genmod;%fithyp(row,col,total);model count = cov1 cov2;

run;

where the invocation of the %FITHYP macro provides the row, column, and table totals foreach 2 × 2 table. The MODEL statement models the mean of the response COUNT by ex-pressing log(λ) as a linear function of covariates COV1 COV2. The regression coefficientsof this linear function are estimated by GENMOD.

The program that fits the hypergeometric distribution tends to run slowly and GENMODsometimes needs a little assistance in obtaining estimated parameter values. The programin Table 8.2, for example, provides initial estimates for the values of the fitted parameters.In this program, the MODEL statement

model cwoc = age / obstats intercept=2initial = -.30 to -.10 by .05 maxit=250000;

provides GENMOD a starting value for the intercept and a set of initial starting values forthe estimated regression coefficients on age. Such values are often obtained by a little trial-and-error and can greatly facilitate convergence of the algorithm. The MODEL statementhere includes the MAXIT= option which allows GENMOD an additional number of iter-ations to achieve convergence of parameter values. As specified, Program 8.2 takes only afew seconds to run.

A portion of the output from the program in Table 8.2 is given in Output 8.4. Thedeviance and chi-squared measures indicate a good fit to the data when you fit a model thatis linear in the log-age. That is,

log λage = α + β log(age) group.

Output 8.4




Deviance 3 4.9930 1.6643










Error

Wald 95%Confidence


Intercept 1 1.9907 0.2627 1.4758 2.5056 57.42 <.0001

loga 1 -0.6768 0.2487 -1.1643 -0.1894 7.41 0.0065

Scale 0 1.0000 0.0000 1.0000 1.0000


The five 2 × 2 tables for every age each provide one degree of freedom. The GENMODprocedure estimates two parameters (an intercept and a regression slope on age) leaving3 df for the deviance and chi-squared statistics. The fit is very good (χ2 = 5.64; 3 df;p = .13) for this model of the data.

The estimated regression slope on log-age category is negative with a high degree of sta-tistical significance (Wald χ2 = 7.41; 1 df; p = .007). The negative estimate for this slopeindicates that the relationship between contraceptive use and infarction becomes weakerwith age. A similar example regressing the log-odds in a series of 2 × 2 tables is presentedin Section 3.2.

Program 8.2 also fits the model in which the log-odds ratio is linear in the age group.That is,

log λage = α′ + β ′age group.

This model has a slightly worse fit than the model that is linear in the log-age (χ2 = 6.29;2 df; p = .098). The regression coefficient (β ′) in this model does not exhibit an extremestatistical significance in testing whether it is different from zero (Wald χ2 = 3.45; 1 df;p = .063).

The fitted means and log odds ratios for this example are given in Table 8.6. The Ob-served count column refers to the upper right counts in each stratum of Table 8.5, or thenumber of cases in each age group who also use oral contraceptives. The Fitted means arethe means of the extended hypergeometric distribution with the fitted log odds ratio andmarginal sums from the corresponding age stratum.

TABLE 8.6 Fitted values for the myocardial infarction case-control study. The log-odds in each stratum islinear in the log-age category. The observed count refers to the number of contraceptive using cases ineach age stratum. The empirical log-odds ratios are also given in Table 8.5 (χ2 = 5.64; 3 df; p = .13).

Age group Observed Expected Log odds ratio

range count count Fitted Empirical

25–29 4 4.03 1.99 1.9830–34 9 6.26 1.52 2.1835–39 4 7.16 1.24 0.4340–44 6 5.16 1.05 1.3145–49 6 4.81 0.90 1.36


8.5 Comparing Several 2 × 2 Tables

This section demonstrates how hypergeometric regression can be used to compare the oddsratios of different 2×2 tables. A second example to illustrate hypergeometric regression isgiven in Table 8.7. This table summarizes a study in which mice of two genetic strains andboth sexes were exposed to a fungicide and subsequently examined for lung tumors. Thefirst of these four 2 × 2 tables is examined as Table 8.1 earlier in this chapter. The log oddsin each of the four tables is given in Table 8.7. The questions addressed with the statisticalmodeling of this data is whether there are large strain and/or sex differences in these logodds ratios.

TABLE 8.7 Lung cancer in four groups of mice exposed to a fungicide. Source: Innes et al. 1969.

Exposure Tumors LogStrain Sex status Yes No Totals odds ratio

X M Exposed 4 12 16Control 5 74 79 1.596

Totals 9 86 95

X F Exposed 2 12 14Control 3 84 87 1.540

Totals 5 96 101

Y M Exposed 4 14 18Control 10 80 90 0.827

Totals 14 94 108

Y F Exposed 1 14 15Control 3 79 82 0.632

Totals 4 93 97


Program 8.3 examines these four 2×2 tables. The program uses strain and sex as CLASSvariables. The MODEL statement contains the option MAXIT=200 which specifies thatup to 200 iterations might be needed to attain convergence of the parameter values. Thisprogram calls upon the %FITHYP macro given in Program 8.4 at the end of this chapter.

The following program fits the extended hypergeometric distribution to the four 2 × 2fungicide exposure tables.


Program 8.3 title1 ’Extended hypergeometric regression in GENMOD’;title2 ’Four fungicide 2x2 tables’;

data avadex;input

expt exposed tumors total strain $ sex $;label

expt = ’exposed mice w/tumors’exposed = ’mice exposed to fungicide’tumors = ’mice with tumors’total = ’sample size in this 2x2 table’strain = ’X or Y’sex = ’M or F’ ;

datalines;4 16 9 95 X M2 14 5 101 X F4 18 14 108 Y M1 15 4 97 Y F

run;

proc genmod;%fithyp(exposed,tumors,total); /* Provide table margins */class strain sex;model expt = strain sex/obstats maxiter=250 type1 type3;

run;

The output of this program is given in Outputs 8.5 and 8.6. The fitted model has analmost perfect fit to the data: both the deviance and chi-squared statistics are much smallerthan one. There are four tables and each table contributes one df. These goodness of fitstatistics each have 1 df because three parameters are estimated: an intercept and one re-gression coefficient for each of the strain and sex effects. The extremely good fit of themodel indicates very weak statistical significance of an interaction effect between sex andstrain.

Output 8.5




Deviance 1 0.0060 0.0060










Error

Wald 95%Confidence


Intercept 1 0.7994 0.3625 0.0889 1.5099 4.86 0.0274

strain X 1 0.7898 0.4529 -0.0979 1.6776 3.04 0.0812

strain Y 0 0.0000 0.0000 0.0000 0.0000 . .

sex F 1 -0.1088 0.5159 -1.1200 0.9024 0.04 0.8330

sex M 0 0.0000 0.0000 0.0000 0.0000 . .

Scale 0 1.0000 0.0000 1.0000 1.0000





Intercept 0.9501

strain 0.0229 1 0.93 0.3356

sex 0.0060 1 0.02 0.8964




strain 1 0.95 0.3288

sex 1 0.02 0.8964

The mouse strain has a modest effect on the log odds ratio. The Wald statistic has ap-value significance of less than .1 but neither the Type 1 nor the Type 3 tests indicatestatistical significance. The X strain mice have log odds ratios about .8 larger than that ofthe Y mice. Interpret this to mean that the X strain is more suseptible to the effects of thefungicide. There is not a statistically significant difference between the male and femalemice. The female log odds ratios are about .1 lower than those of the male mice.

The significance levels for sex and strain differences are very similar for the Type 1 andType 3 tests. These levels are given in Output 8.6. The significance level for the strain givenby the Wald test in Output 8.5 is .08, in comparison to the values of .33 provided by theType 1 and Type 3 values. When there are large discrepancies in the p-values such as yousee here, the preferred method is to cite the Type 3 significance level. This advice is alsogiven in Section 2.5.2.

In conclusion, the hypergeometric regression outlined in this section is not a general-ized linear model in the strictest sense. In generalized linear models, as outlined in the SAS


GENMOD documentation (SAS 1999, pp. 1364–1464), the forward and inverse links arefunctions of XBETA and MEAN alone. This is not exactly true in the present appli-cation. The mean of the hypergeometric distribution given at Equation 8.6 is a functionof the marginal row and column sums of the 2 × 2 table as well as the log odds ratioXBETA . Nevertheless, the SAS macros given in Program 8.4 provide a reasonably close

approximation to the precise implementation of the methods described in this section.The following program creates macros to fit the extended hypergeometric distribution

(Equation 8.6) in GENMOD.

Program 8.4 /*Macros to fit the extended hypergeometric distribution in Genmod

*/

%macro hyp(m,n,nn,x);/* Log of 2 binomial coefficients in the numerator of the extended

hypergeometric distribution: m=row sum; n=col sum;nn=sample size; x=count */

lgamma(&m+1)-lgamma(&x+1)-lgamma(&m-&x+1) /* m choose x *//* nn-m choose n-x */

+ lgamma(&nn-&m+1)-lgamma(&n-&x+1)-lgamma(&nn-&n-&m+&x+1)/* note omitted semicolon */

%mend hyp;

%macro hypmean(m,n,nn,lor);/* Extended hypergeometric mean and variance: m=row sum;

n=column sum; nn=sample size; lor=log-odds ratio */den=0;hypmean=0;hypvar=0;do j=max(0,&m+&n-&nn) to min(&n,&m); /* loop over range */

dterm=exp(%hyp(&m,&n,&nn,j)+j*&lor);den=den+dterm; /* accumulate denominator */hypmean=hypmean+j*dterm; /* accumulate mean */hypvar=hypvar+j**2*dterm; /* accumulate variance */

end;hypmean=hypmean/den; /* divide by denominator */hypvar=hypvar/den -hypmean**2; /* corrected variance */

%mend hypmean;

%macro hypinv(m,n,nn,expv);/* Find the hypergeometric parameter giving a distribution with

expected value equal to expv. This value is called lamd uponcompletion. Interval bisection is used to find lamd.The initial interval for estimation of log odds ratio is+/- lorlimit. This log odds parameter is needed for thedeviance. */

lorlimit= 15;

lamlo= -lorlimit; lamhi= lorlimit; /* initial interval */

/* if expv is at extreme of its range then the model is degenerate */if &expv LE max(0,&m+&n-&nn) then lamhi=lamlo; /* expv low */if &expv GE min(&n,&m) then lamlo=lamhi; /* exvp at hi end */


do until (abs(lamhi-lamlo)<1e-6); /* convergence criteria */lamd=(lamhi+lamlo)/2; /* examine midpoint */%hypmean(&m,&n,&nn,lamd); /* mean at midpoint *//* shrink interval: equate expected value with expv */if hypmean GE &expv then lamhi=lamd;if hypmean LE &expv then lamlo=lamd;

end;%mend hypinv;

%macro fithyp(n,m,nn);/* This macro provides GENMOD with the link, inverse link,

variance, and deviance needed to perform regression on thelog odds ratio parameter of the extended hypergeometricdistribution. The parameters are: n=row sum; m=column sum;nn=2x2 table total. */

/* Inverse link is the mean of the distribution at _XBETA_ */%hypmean(&m,&n,&nn,_XBETA_);mean0=hypmean;

/* Likelihood evaluated at _XBETA_ is denominator of deviance */devden=exp(%hyp(&m,&n,&nn,_RESP_)+_RESP_*_XBETA_)/den;

/* The deviance requires the unconstrained maximum likelihoodestimator. This estimate is the parameter value that equatesthe expected value with the observed value, _RESP_ */

%hypinv(&m,&n,&nn,_RESP_);var0=hypvar; /* Variance at MLE */

/* Maximized likelihood is the numerator of deviance */devnum=exp(%hyp(&m,&n,&nn,_RESP_)+_RESP_*lamd)/den;

/* Deviance at observed count is twice the log ratio */devi= 2*log(devnum/devden);

/* Tell GENMOD about these values */

invlink ilink = mean0; /* Hypergeometric mean at _XBETA_ */fwdlink link = log(_MEAN_); /* Forward link */deviance dev = devi; /* Deviance */variance vari = var0; /* Hypergeometric variance */

%mend fithyp;

The nature of this approximation requires that the MAXIT= option is often needed toincrease the number of iterations needed to attain convergence of parameter values. Thefitting process for some data sets is also aided by providing initial parameter estimates toGENMOD using the INTERCEPT= and INITIAL= options in the MODEL statement. Anexample of the use of these options is included in Program 8.2.

Numerical difficulties might occur when data sets with very large counts are studied.The derivation and original use of the hypergeometric distribution was geared towards theanalysis of data sets with small counts. The Poisson regression example in Section 3.2 isappropriate when the 2 × 2 tables have large counts.

Chapter

9Sample Size Estimation andPower for Log-Linear Models

9.1 Introduction 1499.2 Background Theory 1499.3 Power for a 2 × 2 Table 1549.4 Sample Size for an Interaction 1619.5 Power for a Known Sample Size 167

9.1 Introduction

The topic of power and sample size estimation is presented here in the context of the analy-sis of categorical data using log-linear models. One of the questions most frequently askedby those planning a research project concerns the appropriate sample size. The identifi-cation of the primary outcome, whose values determine the efficacy of the experiment, isoften overlooked but is essential before power can be discussed. Issues such as the de-sign of an experiment and financial constraints far outweigh the problems of power andsample size so these should play a more important role in the discussion. For example, anexperiment in which the estimated sample size far exceeds the available budget should bereconsidered. The estimation of power is often only a small part of the overall plan, but themystique of the statistical machinery often causes it to rise to the top of the fears expressedby those initiating the research.

This chapter begins with a review of the definitions of the basic topics in the language ofhypothesis testing, such as significance level and power. Under the alternative hypothesis,test statistics such as deviance and chi-squared behave as non-central chi-squared randomvariables. Section 9.2 describes this distribution mathematically, in figures, and in tables.Formulas are given for the calculation of the non-centrality parameters and these allow forthe approximation of the power for a given set of hypotheses. Section 9.3 provides a simpleexample that illustrates these principles for the analysis of a 2 × 2 table of counts. A SASprogram is given that provides an estimate of the needed sample size corresponding to arange of values of the power. Section 9.4 examines a larger example based on real data anda realistic set of plans for testing an interaction effect in a log-linear model. Section 9.5shows how to estimate a range of alternative hypotheses that can reasonably be tested fora specified sample size.

9.2 Background Theory

This section provides a brief description of how the power of an experiment is approxi-mated using the Pearson and deviance chi-squared statistics. If these statistics behave as achi-squared under the null hypothesis then they will generally behave as a non-central chi-squared under the alternative hypothesis with the same number of df. This section briefly


reviews some of the mathematics of these two distributions and also reviews some of thelanguage used in hypothesis testing.

If z1, . . . , zk behave as independent, standard normal (zero mean, unit variance) ran-dom variables, then z2

1 + · · · + z2k behaves as a chi-squared with k df. This distribution

describes the general behavior of Pearson chi-squared and deviance under the null hypoth-esis. These statistics will usually have fewer df than the number of observations becauseparameters will typically have to be estimated. Every parameter that is estimated createsa linear dependence among the zi ’s thereby reducing the df of the reference chi-squareddistribution.

Under the alternative hypothesis, suppose each zi is normally distributed with meanµi and unit variance. In this case the statistic z2

1 + · · · + z2k behaves as a non-central

chi-squared with k df and a non-centrality parameter

λ = µ21 + · · · + µ2

k . (9.1)

When all of the µi ’s are zero then λ in Equation 9.1 is also zero and the usual chi-squared and the non-central distributions coincide. Some authors refer to the usual chi-squared (λ = 0) distribution as the central chi-squared distribution to avoid confusionwith the non-central (λ > 0) chi-squared. It is through the non-centrality parameter λthat the power of the Pearson chi-squared statistic is determined. The examples presentedin the following two sections demonstrate how λ can be estimated in order to approximatethe sample size needed.

A non-central chi-squared random variable will tend to be larger in value than a centralchi-squared variate with the same number of df. The greater the non-centrality parameterλ, the larger the difference between these distributions. This chapter demonstrates thatlarger values of λ coincide with larger values of the power.

If you want the null hypothesis to be correctly rejected with high probability, then thenon-centrality parameter needs to be sufficiently large under the alternative hypothesis.The examples given in the following two sections show how the non-centrality parametercan be estimated in practical settings.

These points are illustrated in Figure 9.1 using a 4 df chi-squared distribution as anexample. A central (λ = 0) 4 df chi-squared random variable has an upper 5% criticalvalue of 9.488. The upper 5% tail area of this distribution is indicated by the shaded region.Larger values of λ result in more of the non-central chi-squared area above the criticalvalue of 9.488. Table 9.1 below shows that more than 90% of the area of a non-centralchi-squared with λ = 16 is to the right of 9.488. Loosely speaking, if λ = 16 under thealternative hypothesis, then the power will be approximately 90%.

To quickly review, the significance level is the probability of incorrectly rejecting thenull hypothesis ( λ = 0 in the present situation) when the null is true. The significancelevel is often denoted by the symbol α . You usually pick α to be small; the value of .05 isoften chosen. The null hypothesis is rejected when the observed significance level (usuallycalled the p-value) is smaller than α . That is, the decision is to reject the null hypothesis ifthe observed outcome has (at most) probability α of occurring under the null hypothesis.You can feel confident when you reject the null hypothesis because an unusually smallp-value is very unlikely to have occurred by chance alone.

The power of a study is the probability that the null hypothesis will be correctly rejectedwhen the alternative hypothesis holds. The power of the test will depend on the choiceof the alternative hypothesis. In the case of chi-squared tests, alternative hypotheses withlarger values of the non-centrality parameter λ are more likely to be detected and will havegreater power, all other things being equal. The two sections that follow give examples inwhich a reasonable alternative hypothesis is estimated and used to approximate the samplesize needed for an experiment.

Chapter 9 Sample Size Estimation and Power for Log-Linear Models 151

Figure 9.1Non-central chi-squared density functions with 4 df and values of the non-centralityparameter λ as given.

0 10 20 30 400

.05

.10

.15

.20

λ = 0

λ = 4

λ = 8

λ = 12

λ = 16

9.488

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

.........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

.........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

........

.........

........

........

........

........

........

........

........

........

........

........

........

........

........

.........

........

........

........

........

........

........

........

........

........

........

.........

........

........

........

........

........

........

........

.........

........

........

........

........

........

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.......................................................................................................................................................................................................................................................................................................................................................................................................................

..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

...................................

.............................................................................................................................................................................................................................................................................................................................................................

................................................

.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.........................................

.........................................................................................................................................................................................................................................................................................................................

.......................................

..................................................................

...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

The further the µi ’s are from zero, the larger the value of the non-centrality parameterλ in Equation 9.1. The degrees of freedom remain the same under the null and alternativehypotheses. It doesn’t matter whether the µi ’s in Equation 9.1 are positive or negative. Thenon-centrality parameter is only a function of the squared µi ’s so their signs are ignoredwhen approximating power. This repeats the argument given in Section 2.2 that the chi-squared statistic yields a two-sided test.

The null hypothesis has λ = 0 and the alternative is λ > 0. Larger values of λ underthe alternative hypothesis result in chi-squared tests with greater power. The exact valuesof λ needed for a specified power are given in Table 9.1. The use of this table is illustratedin an example in the following section.

In order to estimate λ and the power of a proposed study you first need an estimate ofthe cell probabilities under the null (p0) and under the alternative (pA) hypotheses. Thisrequirement often creates a circular argument: You don’t know these parameters until youactually run the experiment, but you need their values before the experiment is performed.Nevertheless, you can often use rough estimates, usually on the basis of a small pilot studyor from a published article on a closely related topic.


TABLE 9.1 Example of non-centrality parameters needed to attain specified power at significance levelα = .05 .

Power at significance level .05df .65 .75 .85 .901 5.50 6.94 8.98 10.512 6.21 8.58 10.93 12.663 7.93 9.77 12.30 14.174 8.76 10.72 13.42 15.40

The relationship between the µi ’s and the pi ’s is

µ2i = N

(p0

i − pAi

)2/

p0i .

If p0i and pA

i are moved further apart, then µ2i will be larger and the non-centrality

parameter λ = ∑µ2

i will be larger as well. This relationship is more sensitive to smallvalues of p0

i , which appears in the denominator here. Consequently, the examination ofsparse tables with many small expected counts tend to have difficulties that are not ad-dressed here. A discussion of the special methods and challenges of the analysis of sparsetables of categorical data is given in Zelterman (1999, pp. 189–95).

The non-centrality parameter of the Pearson chi-squared statistic is

λP = N∑

i

(p0

i − pAi

)2 /p0

i (9.2)

and is proportional to the sample size, N .Notice the similarity in the functional form of Equation 9.2 to that of the chi-squared

statistic. You can also write λP in terms of expected cell means:

λP =∑

i

(N p0

i − N pAi

)2 /N p0

i (9.3)

The non-centrality parameter for the deviance statistic

λD = 2N∑

i

p0i log

(p0

i

/pA

i

)(9.4)

is also proportional to the sample size. As with Equation 9.2, the non-centrality parameterof the deviance is similar in functional form to that of the deviance statistic.

Values of λD in Equation 9.4 should be close to those of λP given at Equation 9.2for the chi-squared statistic under the same conditions for which deviance and chi-squaredwill be similar in value. These conditions require that none of the p0

i ’s be very small. Ad-ditionally, a good pair of hypotheses need to have the pairs of values (pA

i , p0i ) reasonably

close together so that the statistical test of these hypotheses makes sense. For example, it isunreasonable, from a statistical point of view, to test a statistical hypotheses with p0

i = 0and pA

i = 1 because a single observation will conclusively prove which hypothesis is true.If you fix the values of p0 and pA and only vary N in Equation 9.2, then you can find

the power of the chi-squared or deviance statistics for a given set of null and alternativehypotheses in terms of the sample size. An example in the following section demonstratesthe typical process, which starts with the non-centrality parameter from Table 9.1 that givesthe appropriate power and then solves for the required sample size N in Equation 9.2.

A small set of values of the non-centrality parameters and their corresponding powersat significance level α = .05 are given in Table 9.1. An illustration corresponding to this


table is given in Figure 9.1. Figure 9.1 shows that more than 90% of the area of a 4 dfnon-central chi-squared with λ = 16 is to the right of the upper 5% critical value for thecentral distribution. Table 9.1 shows that when λ = 15.40 then exactly 90% of the non-central chi-squared area is above the critical value. In other words, a 4 df chi-squared withnon-centrality parameter 15.40 will have power 90%.

More extensive tables are unnecessary because other values of power, df, and so on canbe calculated in SAS. It is possible to calculate the approximate sample size necessary in asingle line:

n=cnonct(cinv(1-sig,df),df,1-power)/ncp;

Here SIG is the desired significance (α) level; NCP is the non-centrality parameter fora sample of N = 1 calculated at Equation 9.2 or 9.3; and DF is the degrees of freedomassociated with the problem. A set of estimated sample sizes and powers for a given valueof the NCP parameter are calculated in the SAS macro in Program 9.1.

Program 9.1 /* Approximate power for chi-squared tests.INVOKE AS %chipow(df=1,ncp=.,alpha=.);INPUT:

df = degrees of freedom, taken to be 1 if missingncp = non-centrality parameter for a sample of size n=1alpha = significance level under null hypothesis, taken

to be .05 if missingOUTPUT variables:

power = values .25 to .9 by .05 and .9 to .99 by .01n = approximate sample size at respective power

*/%macro chipow(df,ncp,alpha);

sig=αif sig=. then sig=.05;

do j=1 to 23; /* range for power */if j<15 then power=.2 + j/20; /* power is .25 by .05 to .9 */if j>=15 then power=.9+.01*(j-14); /* ... and .9 by .01 to .99 */

/* approximate sample size */n=cnonct(cinv(1-sig,&df),&df,1-power)/&ncp;output;

end;drop j sig; /* drop local variables */

%mend chipow;

An easy way to approximate the NCP parameter for these macros is by setting

ncp = χ2/N (9.5)

or else

ncp = G2/N (9.6)

where χ2 or G2 are computed on the N observations from a pilot study. An example isgiven in the following section.

A general SAS macro is given in Program 9.1 to calculate estimated sample sizes forvarious choices of power, significance levels, and degrees of freedom. To invoke thesemacros the following SAS program is sufficient:


title1 ’Approximate power for chi-squared tests’;data initial; /* specify: df, ncp, and alpha parameters */%chipow(df=1,ncp=.0152,alpha=.05);

run;proc print noobs; /* print power table */run;

In summary, this section outlines the usual manner in which sample sizes are estimated.Begin by approximating the cell probabilities for the null and alternative hypotheses. Thenuse these values for the pi ’s in Equations 9.2 and 9.4 to approximate the non-centrality pa-rameter. The appropriate value of the non-centrality parameter λ from Table 9.1 providesthe power. It only remains to solve for the sample size N necessary to attain the desiredpower. Two examples are illustrated in the following sections. The material in Section 9.5shows you how to estimate the alternative hypotheses that can reasonably be tested whenthe sample size N is known but the null and alternative hypothesis parameters are notknown.

9.3 Power for a 2 × 2 Table

As a numerical example of computing power and estimating sample sizes, consider anexamination of a single 2 × 2 table taken from the data given in Table 8.7. The Y-strainmale mice in this experiment are summarized in Table 9.2. Mice were either exposed to afungicide or else given the usual care. After 85 weeks all the mice were sacrificed and theirlungs were examined for cancerous growths by a group of trained pathologists. Analysesof these mice and others in the same experiment appear in Sections 2.2 and 8.5.

TABLE 9.2 Observed frequencies for the Y-strain male mice in Table 8.7. This 2 × 2 table summarizesincidence of lung cancer in mice exposed to a fungicide. (χ2 = 1.64; G2 = 1.45) Source: Innes et al.1969.

Exposure Tumorsstatus Yes No Totals

Exposed 4 14 18Control 10 80 90Totals 14 94 108


The null hypothesis is that there is no difference between the exposure status of thesemice. That is, the probability of eventual tumor development is the same whether or not themice are exposed. The chi-squared and deviance statistics test this null hypothesis againstan alternative in which there is an association between exposure and the development oflung cancer.

An examination of this data reveals that for the group of mice in Table 9.2 there is nostatistically significant relationship between fungicide exposure and the development oflung cancer. The relevant statistics are

χ2 = 1.64; 1 df; p = .20

and

G2 = 1.45; 1 df; p = .22.

The empirical odds ratio is 2.29. There is likely to be some positive association betweenexposure and subsequent tumor development that is not statistically significant following


the experiment. One possible difficulty is that this experiment did not have a sufficientlylarge sample of mice in order to demonstrate the relationship. Of course, there might notbe any relationship between fungicide exposure and cancer development for this group ofmice and this would also be useful to know.

The question addressed in this section is the estimation of a sample size from a pi-lot study in order to demonstrate a relationship. A few assumptions are necessary to statebefore the mathematics can be used. One of these assumptions is that the observed fre-quencies of Table 9.2 are estimates of the alternative hypothesis you wish to test. Thecalculations are fairly easy if you assume that mice will be allocated to exposure and unex-posed groups in the same proportions as in the experiment summarized in Table 9.2. Thatis, there will be 90/18 = 5 control mice for every one exposed to the fungicide. A differentallocation scheme is demonstrated below.

Suppose mice are to be allocated to the exposed and unexposed groups in the same 1 : 5proportion as in the experiment summarized in Table 9.2. The non-centrality parameter λPfor the Pearson chi-squared statistic can be estimated using Equation 9.5, giving

λP = N ncp = N (1.64/108) = .0152 N

where 1.64 is the observed value of chi-squared in Table 9.2 and 108 is the sample size.Table 9.1 shows that a 1 df chi-squared will require a non-centrality parameter of λ =

6.94 to achieve a power of 75%. A sample of size

N = 6.94/.0152 = 456.6

is then required for power 75%. Similarly, a power of 90% requires a non-centrality pa-rameter of λ = 10.51 which is obtained with a sample of size

N = 10.51/.0152 = 691.4.

In a more general setting, the SAS program

data initial;%chipow(df=1,ncp=.0152,alpha=.05);

run;proc print noobs;run;

uses the macro of Program 9.1 and produces Output 9.1 for a wide range of power values.Output 9.1 shows that a sample of size 108 has low power in the neighborhood of

only .25. This is usually considered to be too low for most practical uses. A sample size inthe range of 450–600 is reasonable in order to achieve a power of 75% to 85%. The powerestimates from Output 9.1 are plotted in Figure 9.2.


Output 9.1 power n

0.25 108.40

0.30 135.40

0.35 163.01

0.40 191.54

0.45 221.31

0.50 252.70

0.55 286.16

0.60 322.27

0.65 361.86

0.70 406.05

0.75 456.60

0.80 516.37

0.85 590.68

0.90 691.28

0.91 716.76

0.92 744.96

0.93 776.61

0.94 812.72

0.95 854.92

0.96 905.85

0.97 970.49

0.98 1059.86

0.99 1208.72


Figure 9.2Estimated power for chi-squared as a function of the sample size N for two differentallocations of control and exposed mice. The pilot data is given in Table 9.2.

250 500 750 1000 1250

Sample size N

.3

.5

.7

.9

Power

2 : 1 allocation 5 : 1 allocation

.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.....................................

............................................

........................................................

....................................................................................

..................................................................................................................

.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.................................

.........................................

...........................................................

............................................................................

The analysis using the deviance statistic has an estimated non-centrality parameter

λD = N ncp = N (1.45/108) = .0134 N

where G2 = 1.45 and 108 is the sample size.The power estimates for this statistic and its non-centrality parameter are obtained in a

similar manner. The SAS program is

data;%chipow(df=1,ncp=.0134,alpha=.05);


and produces Output 9.2. Sample sizes in the range of 520–670 are needed to obtain powerof 75% to 85% for the deviance statistic. These sample size estimates are comparable butjust slightly larger than those estimated from the chi-squared statistic in Output 9.1.


Output 9.2 power n

0.25 122.96

0.30 153.58

0.35 184.91

0.40 217.27

0.45 251.04

0.50 286.64

0.55 324.59

0.60 365.57

0.65 410.47

0.70 460.60

0.75 517.93

0.80 585.74

0.85 670.03

0.90 784.14

0.91 813.04

0.92 845.03

0.93 880.93

0.94 921.89

0.95 969.75

0.96 1027.53

0.97 1100.85

0.98 1202.23

0.99 1371.08

The remainder of this section describes how to estimate the sample size if a differ-ent allocation of mice to the control and exposed groups is desired. The process involvesproviding estimates of the various cell probabilities under both the null and alternative hy-potheses and then making use of formulas in Equations 9.2 and 9.4 to approximate thevalue of the non-centrality parameters. These non-centrality parameters can then be usedin the macros of Program 9.1 as demonstrated above.

To estimate the power for a large study to repeat the experiment of Table 9.2 begin bynoting that estimated rates for tumor development is

4/18 = 22.22%

for the exposed group of mice and

10/90 = 11.11%

for the unexposed, control mice.


The ratio of control to exposed mice in Table 9.2 is 90/18 or exactly 5 : 1. If the newexperiment is designed so that this ratio is 2 : 1, then the frequencies anticipated under thealternative hypothesis pA

i are given in Table 9.3.

TABLE 9.3 Probabilities anticipated under the null and alternative hypothesis for testing the relationshipbetween fungicide exposure and development of lung tumors based on the pilot data of Table 9.2. Miceare allocated to control and exposed groups in the ratio of 2 : 1.

Alternative hypothesis: pAi


Exposed .074 .259 .333Control .074 .593 .667Totals .148 .742 1.000

Null hypothesis: p0i


Exposed .049 .284 .333Control .099 .568 .667Totals .148 .742 1.000

As a numerical example, the number of exposed mice that are anticipated to developtumors is 22.22% of the 33.33% mice exposed or

.2222 × .3333 = .0741.

The probabilities p0i for the null hypothesis are obtained in Table 9.3 by multiplying

the various marginal rates.The non-centrality parameter λ for chi-squared is obtained using Equation 9.2. From

the two sets of probabilities in Tables 9.3 calculate:

λP = N{(.074 − .049)2/.049 + · · · + (.593 − .568)2/.568

}= .0224 N .

Then refer to Table 9.1 to see that a 1 df chi-squared with power .75 requires λ = 6.94 .Consequently a sample of size

N = 6.94/.0224 = 310.2

is needed.Similarly, an experiment with power .90 requires λ = 10.51 and a sample of size

N = 10.51/.0224 = 469.8.

More generally, the SAS program

data;%chipow(df=1,ncp=.0224,alpha=.05);


produces Output 9.3. These values are plotted in Figure 9.2.


Output 9.3 power n

0.25 73.557

0.30 91.875

0.35 110.614

0.40 129.976

0.45 150.178

0.50 171.474

0.55 194.177

0.60 218.686

0.65 245.547

0.70 275.536

0.75 309.835

0.80 350.396

0.85 400.821

0.90 469.081

0.91 486.372

0.92 505.512

0.93 526.983

0.94 551.490

0.95 580.121

0.96 614.684

0.97 658.545

0.98 719.192

0.99 820.200

The deviance statistic for this problem has non-centrality parameter

λD = .0206N

requiring samples of sizes

6.94/.0206 = 336.3

for power .75 and

10.51/.0206 = 509.3

for power .90.These sets of sample sizes for the deviance are only slightly larger than the correspond-

ing values of 310 and 470 obtained above for the chi-squared test. There should generallybe only small differences between the estimated sample sizes for these two methods. Somegain in power is achieved by choosing a 2 : 1 allocation ratio of mice for the planned


experiment, rather than choosing a 5 : 1 ratio. A smaller sample size is required when thesampling allocation is more evenly balanced. This allocation is something you can easilycontrol in the experiment and it is clearly advantageous for you to do so unless there arereasons that are not a part of this power analysis.

In summary, this section shows how an alternative hypothesis can be estimated resultingin an approximated non-centrality parameter λP or λD for 2 ×2 tables. Equations 9.2 and9.4 give the relationships between the probabilities under the null and alternative hypothe-ses, the sample size N , and the non-centrality parameters, λP for the Pearson chi-squaredstatistic and λD for the deviance statistic. Values of λ in Table 9.1 can then be used toconvert these non-centrality parameters into estimates of the power and sample sizes.

The macro in Program 9.1 produces estimated sample sizes needed for a specified sig-nificance level and values of the power ranging from .25 to .99. This program provides awide range of values for the power and can be modified for other values of the power ifdesired. A useful presentation aid is to plot the power attained as a function of the samplesize as you saw in Figure 9.2.

9.4 Sample Size for an Interaction

The data in Table 9.4 is given by Bishop (1969) and Bishop et al. (1975, pp. 41–2). Thisdata summarizes a survey in two clinics concerning the relationship between prenatal careand infant mortality. The level of care was classified as the binary values of More or Lessand the outcome was whether or not the infant survived to a specified age. This section usesthis data set hypothetically to plan a similar study at two clinics using the same outcomevariables. The implicit assumption is that if you repeat the survey, then you will obtainalmost the same proportions as reported. In other words, the differences in clinics found inthis survey are representative of the differences you will find if you repeat the study.

TABLE 9.4 Infant mortality at two clinics classified by amount of prenatal care. Source: Bishop (1969).

PrenatalClinic care Died Survived

A Less 3 176More 4 293

B Less 17 197More 2 23


At a meeting with the researchers, suppose the following two questions came up:

Question 1. How large of a sample is needed to show a relationship between prenatalcare and mortality?

Question 2. How large of a sample is needed to show a difference between clinics?

You will need to translate these two questions into the framework of a hypothesis testbased on the chi-squared statistic and non-central chi-squared alternative hypotheses. Thefirst question is fairly easy to resolve and its solution roughly follows the example givenin Section 9.3. The second question is somewhat ambiguous and might be open to severaldifferent interpretations.

To answer the first question, begin by collapsing the clinic categories to form the 2 × 2marginal table given in Table 9.5. This table cross-classifies infant mortality by amount ofprenatal care and is used to estimate the frequencies for the alternative hypothesis. The fourfrequencies in this table can be used to approximate the alternative hypothesis that prenatalcare is indeed related to infant mortality.


TABLE 9.5 Marginal table of data in Table 9.4 classifying infant mortality by amount of prenatal care(χ2 = 5.26; 1 df; p = .022).

Prenatalcare Died Survived TotalsLess 20 373 393More 6 316 322Totals 26 689 715

The chi-squared for the marginal data is 5.26, with 1 df. Chi-squared estimates the non-centrality parameter for a sample of size 715 in Table 9.5. The non-centrality parameter fora sample of size N is then

λP = N × 5.26/715 = .0073566 N .

Use this value in the SAS program

title1 ’Approximate power for Question 1 of Section 9.4’;data;

%chipow(df=1,ncp=.0073566,alpha=.05);run;proc print noobs;run;

This program produces Output 9.4.


Output 9.4 Approximate power for Question 1 of Section 9.4

power n

0.25 223.97

0.30 279.75

0.35 336.81

0.40 395.76

0.45 457.27

0.50 522.12

0.55 591.25

0.60 665.87

0.65 747.66

0.70 838.98

0.75 943.41

0.80 1066.91

0.85 1220.45

0.90 1428.30

0.91 1480.95

0.92 1539.22

0.93 1604.60

0.94 1679.22

0.95 1766.40

0.96 1871.64

0.97 2005.20

0.98 2189.86

0.99 2497.41

The original data in Table 9.4 has a sample of size 715. The power for finding a differ-ence between different rates of infant death by level of prenatal care should be about .65with a sample of this size. Output 9.4 gives the approximate power for other sample sizesto test Question 1. power of .65 might be too low for most studies and a sample on theorder of 1000–1200 for a power of about .80 is more realistic.

The second question raised in planning this survey is the sample size required to detecta difference between clinics. This question is somewhat ambiguous as you will see nextand can lead to vastly different results based on how it is interpreted.

One interpretation is taken in light of the analysis of Question 1. In the first questionyou want to demonstrate that there is a statistically significant cross-product ratio betweenprenatal care and mortality. The search for a clinic difference might mean that you areinterested in showing that the strength of this relationship between mortality and prenatalcare is different between the two clinics. This is a test of a three-way interaction.


Look back at Table 9.4 to see that the observed cross-product ratio for clinic A is

(3 × 295)/(4 × 176) = 1.249

and the cross-product ratio of clinic B is

(17 × 23)/(2 × 197) = 0.992.

A difference or ratio of two cross-product ratios is a three-way interaction in a log-linearmodel. This interpretation of the three-way interaction is described, modeled, and testedin examples of Sections 3.3 and 8.4. The null hypothesis of no difference between the twoodds ratios corresponds to the log-linear model with no three-factor interaction, namely

log mcpm = µ+ αc + βp + εm + ηcp + φcm + ξpm . (9.7)

The indices (c, p,m) correspond to the two levels of clinic, prenatal care, and mortality,respectively.

You can fit Equation 9.7 in GENMOD using the code

proc genmod;class clinic care surv;model count = clinic | care | surv @2 / dist = Poisson;

run;

The fitted means are given in Table 9.6. These fitted values yield the same estimated cross-product ratio of 1.11 in both clinics. That is, Equation 9.7 asserts that there is no clinicdifference in terms of the relationship between infant mortality and prenatal care.

TABLE 9.6 Fitted means for the infant mortality data of Table 9.4 for Equation 9.7 with no three-factorinteraction (χ2 = .044, 1df).

PrenatalClinic care Died Survived

A Less 2.81 176.19More 4.19 292.81

B Less 17.19 196.81More 1.81 23.19

The fit of Equation 9.7 is very good: (χ2 = .044). This is an indication that the twohypotheses are very close together and a huge sample size will be necessary to distinguishbetween them. The approximate non-centrality parameter for a sample of size N is

λP = N × .044/715 = 6.15 × 10−5 N .

Chi-squared has 1 df, reflecting the one parameter difference between the log-linear modelin Equation 9.7 for the null hypothesis and the unconstrained alternative hypothesis.

A huge sample size is needed to detect a difference of this small magnitude. For ex-ample, a study of power .75 would require a sample of size

N = 6.94/6.15 × 10−5 ≈ 113,000.

This figure is much higher than the approximation of 1000 to 1200 anticipated by theanswer to Question 1, described above.

A distinction should be made between statistical significance and practical significance.Intuitively, a huge same size is needed here because the magnitude of the effect com-paring clinics is very small. Proving a statistically significant result, however, might notcorrespond to any meaningful comparison in terms of these differences. The statistical andpractical (often called clinical) significance are measured on different scales and are inter-


preted differently. The statistician might show a difference using analytic methods, but thepractical significance asks whether these differences are large enough for the reader to beconcerned.

The problem in Question 2 has to be interpreted differently. Such reformulation of thebasic research question is a common way to proceed. The scientific thinking about theproblem has to be adjusted in light of the data. You can still discuss the differences be-tween the two clinics, as outlined in Question 2 but the specific differences have to beinterpreted differently. You could also ask whether the outcome variable or primary out-come is appropriate, or whether your measures are meaningful. There are many differentways to proceed at this point.

One possible direction you can take is to suppose the aim of showing clinic differencescan be expressed as showing a difference in the mortality rates of the different clinics. Asurvival by clinic interaction is a pair-wise interaction and can be summarized in a marginal2 × 2 table, just as the interaction of mortality by level of care is summarized in Table 9.5.The marginal clinic by survival 2 × 2 table appears as Table 9.7.

TABLE 9.7 Marginal table of data in Table 9.4 classifying infant mortality by clinic (χ2 = 19.06; 1df).

Clinic Died Survived TotalsA 7 469 476B 19 220 239

Totals 26 689 715

You could proceed with the value of χ2 = 19.06 from Table 9.7 but some cautionshould be used here. In light of Question 1, the important interaction between prenatalcare and infant mortality has been summed over in Table 9.7. This risks the problem ofSimpson’s paradox. The dangers of Simpson’s paradox are summarized in Section 2.4.Simpson’s paradox is most likely to occur in marginal tables such as Table 9.7 that sumover other important interactions.

A better way to proceed that avoids the risks of Simpson’s paradox is to fit a log lin-ear model that includes the interaction between mortality and prenatal care but omits themortality-by-clinic interaction:

log mcpm = µ+ αc + βp + εm + ηcp + ξpm . (9.8)

This log-linear model has a poor fit: (χ2 = 13.51; 2 df; p = .0012). The poor fit showsthat a test of the difference in mortality rates at two clinics will have a reasonable power, soa huge sample size will not be necessary. You could also use the chi-squared statistic fromthe marginal data in Table 9.7 here, but the partial association obtained from Equation 9.8is a safer way to proceed.

The non-centrality parameter for a sample of size N is

λP = N × 13.51/715 = .0189 N

for a non-central chi-squared with 2 df. The SAS program

title1 ’Approximate power for Question 2 of Section 9.4’;data;%chipow(df=2,ncp=.0189,alpha=.05);


produces Output 9.5.


Output 9.5 Approximate power for Question 2 of Section 9.4

power n

0.25 119.34

0.30 146.85

0.35 174.54

0.40 202.77

0.45 231.88

0.50 262.26

0.55 294.35

0.60 328.72

0.65 366.08

0.70 407.50

0.75 454.54

0.80 509.77

0.85 577.95

0.90 669.52

0.91 692.61

0.92 718.11

0.93 746.67

0.94 779.19

0.95 817.10

0.96 862.75

0.97 920.52

0.98 1000.11

0.99 1132.06

In Output 9.5 you can see that a sample of size 500–600 is needed to attain power ofapproximately 85%. This estimated sample size is about half of the estimate needed toanswer Question 1, earlier in this section.

Of course, the cost of enrolling subjects might far outweigh the advantage in power.This message is repeated from the introduction of this chapter. Namely, that power andsample size approximations are only a part of the design of a good experiment and otherissues might far outweigh the calculation of power. Planners and researchers need to knowabout power, but only as a part of the larger picture. Hopefully, this produces a discussionthat could lead to a reformulation of the original research questions, resulting in anotherround of power calculations. Successive iterations in this fashion are useful in achieving aclearer vision of the overall objectives.


9.5 Power for a Known Sample Size

The examples of the previous two sections use published reports to determine samplesizes for prospective studies or, perhaps, allow you to estimate the power of the studyalready completed. Retrospective power estimates are often useful when a study has failedto achieve a statistically significant finding and the authors want to show the likelihood ofthe outcome they had already obtained.

In prospective design, it is often the case that researchers are unable to point to a relevantpilot study. Sometimes all you know is the size of the budget and that this constraint deter-mines the sample size. In this section you begin with the sample size and ask what kindsof alternative hypotheses can be tested. Following the discussion of the previous section,these analyses should be part of the planning process. You can examine various alternativehypotheses, a priori, and then ask whether these are worth pursuing.

If you assume that all marginal totals are 1/2, then the null hypothesis can be expressedwith cell probabilities as given in the top half of Table 9.8. The bottom half of Table 9.8provides one possible parameterization for an alternative hypothesis. The marginal sumsare retained in the null and alternative hypothesis. The null hypothesis corresponds to θ =0. This alternative hypothesis only makes sense when θ is not too far from zero so that allfour cell probabilities are positive.

TABLE 9.8 Null and alternative hypothesis cell probabilities for a 2 × 2 table when nothing is known apriori about the outcome and all marginal totals are equal. Table 9.9 is more general and enables you tovary the marginal probabilities.

TotalsNull .25 .25 .5hypothesis .25 .25 .5Totals .5 .5 1.0

TotalsAlternative .25 + θ .25 − θ .5hypothesis .25 − θ .25 + θ .5Totals .5 .5 1.0

The non-centrality parameter for testing the hypotheses in Table 9.8 is

λ = 16Nθ2. (9.9)

The non-centrality parameter λ is determined by the significance level and the powerdesired. Similarly, the sample size N has also been specified by budgetary constraints. Itonly remains for you to solve for the parameter

θ = ±(λ/16N )1/2.

The odds ratio of this alternative hypothesis is then

ψ = (1/4 + θ)2/(1/4 − θ)2,

or is equivalently 1/ψ when the sign of θ is reversed.Program 9.2 calculates both θ and the odds ratio ψ > 1 for different levels of the sig-

nificance level and power. The input to this program is the sample size N and the marginalprobabilities of the table. Remember, you assumed that all marginal probabilities are equalto 1/2 in this example. Three levels of significance (.01, .05, and .10) are given in this ex-ample, but these can easily be changed. The levels of power are the same as those given inProgram 9.1.


Program 9.2 /* Power for the null situation where margins are known a prioriin a 2x2 table.

INVOKE this macro as %nullpow(n=100, p1=.5, p2=.5);INPUT:

n= known sample size, assumed to be 100 if missingp1,p2 = marginal probabilities, set to 1/2 if missing

OUTPUT variables:power = values .25 to .9 by .05 and .9 to .99 by .01psi1-psi3 = odds ratios that can be tested with

specified power and alpha=.01, .05, and .10,respectively. */

%macro nullpow(n=100,p1=.5,p2=.5);%let nsig=3; /* number of significance levels */array alpha(&nsig) alpha1-alpha&nsig;array theta(&nsig) theta1-theta&nsig;array psi(&nsig) psi1-psi&nsig;alpha(1)=.01; /* specify significance levels */alpha(2)=.05;alpha(3)=.10;m= 1 / (&p1*&p2) + 1 / (&p1*(1-&p2)) + 1 / ((1-&p1)*&p2)

+ 1 / ((1-&p1)*(1-&p2)); /* multiplier in Equation 9.10 */do j=1 to 23; /* range for power */

if j<15 then power=.2 + j/20; /* power is .25 by .05 to .9 */if j>=15 then power=.9+.01*(j-14); /* and .9 by .01 to .99 */do k=1 to &nsig; /* for every different significance level */

t=cnonct(cinv(1-alpha(k),1),1,1-power); /* 1 df crit val */theta(k)=sqrt(t/(m*&n)); /* theta in Table 9.9, 10 */psi(k)=((1+4*theta(k))/(1-4*theta(k)))**2; /* odds ratio */

end;output; /* every level of power on a different line */

end;%mend nullpow;title1 ’Approximate odds ratios for chi-squared tests’;title2 ’Significance levels are .01, .05, and .10’;data initial;%nullpow(n=150,p1=.5,p2=.5); /* specify N and marginal prob’s */


var power psi1 psi2 psi3;run;

As an illustration of the use of Program 9.2, consider a study in which N = 150observations are the maximum you can afford. The SAS program

data subseq;%nullpow(n=150,p1=.5,p2=.5);



invokes the macro with a sample of this size and specifies the marginal probabilities ofTable 9.8. The %NULLPOW macro produces a data set containing values of θ and ψ fordifferent significance levels and powers. Printing the values of ψ produces Output 9.6.


Output 9.6 Approximate odds ratios for chi-squared testsSignificance levels are .01, .05, and .10

power psi1 psi2 psi3

0.25 1.87018 1.52314 1.36718

0.30 1.96672 1.60110 1.43935

0.35 2.06101 1.67691 1.50856

0.40 2.15505 1.75229 1.57680

0.45 2.25052 1.82861 1.64556

0.50 2.34904 1.90720 1.71610

0.55 2.45235 1.98942 1.78972

0.60 2.56251 2.07690 1.86790

0.65 2.68221 2.17175 1.95252

0.70 2.81519 2.27686 2.04615

0.75 2.96709 2.39662 2.15267

0.80 3.14727 2.53824 2.27846

0.85 3.37324 2.71523 2.43537

0.90 3.68452 2.95793 2.65009

0.91 3.76460 3.02016 2.70507

0.92 3.85390 3.08946 2.76624

0.93 3.95494 3.16775 2.83532

0.94 4.07146 3.25788 2.91477

0.95 4.20928 3.36427 3.00847

0.96 4.37825 3.49440 3.12295

0.97 4.59703 3.66239 3.27055

0.98 4.90823 3.90040 3.47929

0.99 5.45122 4.31314 3.84028

A quick examination of Output 9.6 shows that a sample of size N = 150 has 80%power to detect an odds ratio of 2.5 or greater with significance level .05. Greater power ora smaller significance level require a larger odds ratio. Relaxing the significance level to .1from .05 enables you to detect odds ratios of about 2.25 with the same power. Similarly,a significance level of .01 requires an odds ratio greater than 3 in order for you to attain apower of 80%.

These estimates for power and sample size are based on the sample size alone. Out-put 9.6 was produced using no knowledge of the problem, the design of the study, oroutcome measures. As a consequence, these estimates are generally conservative in nature.That is to say, if you knew just a little bit more about the problem, or perhaps, were willingto commit yourself, then there will be some gain in power.


If you knew (or could provide an educated guess at) the marginal probabilities of thetwo binary valued variables that make up the 2 × 2 table, then the hypotheses could beexpressed using the notation of Table 9.9. Table 9.9 enables you to describe the moregeneral hypotheses in terms of the marginal probabilities p1 and p2.

TABLE 9.9 Null and alternative hypotheses for a 2 × 2 table when the marginal totals can beapproximated by probabilities p1 and p2.

TotalsNull p1 p2 p1(1 − p2) p1hypothesis (1 − p1)p2 (1 − p1)(1 − p2) 1 − p1Totals p2 1 − p2 1.0

TotalsAlternative p1 p2 + θ p1(1 − p2)− θ p1hypothesis (1 − p1)p2 − θ (1 − p1)(1 − p2)+ θ 1 − p1Totals p2 1 − p2 1.0

The non-centrality parameter for testing the two hypotheses in Table 9.9 is

λ = Nθ2 {1/p1 p2 + 1/p1(1 − p2)+ 1/(1 − p1)p2 + 1/(1 − p1)(1 − p2)} . (9.10)

The non-centrality parameter in Equation 9.10 is always at least as large as that givenin Equation 9.9. Specifically, these two values of λ will only agree when p1 = p2 = .5 .The larger non-centrality parameter in Equation 9.10, and the subsequent greater power, arepartly due to the specification of marginal probability parameters p1 and p2 in Table 9.9.

As a numerical example, suppose you are willing to believe that p1 = 1/4 and p2 =1/3. You could invoke the macro of Program 9.2 using

data subseq;%nullpow(n=150,p1=.25,p2=.333);



Output 9.7 is produced using this code. An odds ratio of 2.1 should have power 80% atsignificance level .05 and a sample of size 150. Compare this output to Output 9.6 whereyou saw that an odds ratio of 2.5 was needed in order to achieve the same level of powerand sample size.


Output 9.7 Approximate odds ratios for chi-squared testsSignificance levels are .01, .05, and .10

power psi1 psi2 psi3

0.25 1.66273 1.40813 1.28986

0.30 1.73167 1.46631 1.34492

0.35 1.79828 1.52232 1.39719

0.40 1.86404 1.57747 1.44824

0.45 1.93013 1.63281 1.49922

0.50 1.99765 1.68926 1.55106

0.55 2.06775 1.74777 1.60467

0.60 2.14173 1.80945 1.66109

0.65 2.22124 1.87565 1.72158

0.70 2.30857 1.94825 1.78783

0.75 2.40708 2.03002 1.86239

0.80 2.52229 2.12550 1.94935

0.85 2.66441 2.24302 2.05628

0.90 2.85614 2.40117 2.19999

0.91 2.90474 2.44119 2.23633

0.92 2.95861 2.48551 2.27655

0.93 3.01915 2.53528 2.32170

0.94 3.08843 2.59218 2.37329

0.95 3.16967 2.65882 2.43369

0.96 3.26823 2.73957 2.50684

0.97 3.39426 2.84265 2.60015

0.98 3.57052 2.98653 2.73024

0.99 3.87025 3.23038 2.95041

A small gain in power, but a gain nonetheless, is attained if you have some knowledgeof the problem you are studying. Of course, this analysis fails to answer the more im-portant question of whether an odds ratio of 2.1 has any practical significance. Two briefhypothetical examples illustrate this.

If you learn that carrying metal keys in your pocket doubles your chances of getting hitby lightning, then your behavior is unchanged. The probability of a lightning strike is sosmall that twice that amount is still very small. Besides, your keys are important to havehandy at all times.

If, on the other hand, you are told that living in proximity to strong electrical fieldsdoubles your chances of developing certain cancers, then your fear of these diseases mightchange your behavior. It might even induce you to move elsewhere. In both of these hy-pothetical situations, there might be similar statistical significance in the findings but thepractical implications are very different.

172

Appendix

AThe Output Delivery System

The Output Delivery System (ODS) provides great flexibility in the control of output fromSAS procedures. It is not specific to the GENMOD procedure. The parts that are specificto the SAS Version 8 GENMOD procedure are explained below. ODS replaces the MAKEoption that enables you to create SAS data sets from sections of the output of a procedure.ODS has the additional ability to capture output and reformat it in HTML. ODS is an areaof active development at SAS, so expect to see changes in many of the items discussedhere as the software develops.

The following example uses ODS to create a new SAS data set from the OBSTATS out-put of the GENMOD procedure. Without ods exclude, the OBSTATS option in GEN-MOD produces the large number of observational statistics including fitted values andresiduals that is usually associated with the use of OBSTATS. Some of these statisticscould be very useful in a later procedure, such as plotting the residuals produced by OB-STATS. In this example, however, they are not needed.

proc genmod data=mydat;model freq = cov1 cov2 cov3

/ dist=Poisson obstats;ods output obstats = fit; /* create obstats data set */ods exclude listing obstats; /* remove the obstats listing */run;proc print data=fit;

var cov1 pred;run;

ODS OUTPUT in this example identifies the OBSTATS text as output. This materialis written to a SAS data set called FIT in this example, but it could be given any othervalid data set name. Notice that the newly created FIT data set contains both a covariateCOV1 from the original MYDAT data as well as the predicted value PRED that is createdby GENMOD.

A listing of all of the separate files that the GENMOD output can capture through useof ODS is given in the LOG file of the program:

ods trace on;proc genmod;

. . . more statements. . .

run;ods trace off;

If you use the statement ODS TRACE ON /LISTING, then ODS places labels in yourOutput window, immediately preceding the items they identify. This is handy if you want tosee exactly what is covered with each item. Do not forget the ODS TRACE OFF statementor you will continue to see these labels put in places you might not want.


Table A.1 provides a list of some of the many useful items available through ODSin the GENMOD procedure for SAS Version 8. Future versions might include differentinformation or omit some of the items listed here. Each item will have individual variablesthat you can examine and treat as any other data.

TABLE A.1 A list of some of the items that are available in the GENMOD procedure through ODS.

Item name Description

ClassLevels Parameter assignments for use with the CLASSstatement

ConvergenceStatus Whether or not the model fitting convergedCorrB Correlation matrix for the estimated coefficientsCovB Covariance matrix of the estimated coefficientsIterLRCI Convergence history for the LRCI optionIterParms Convergence history for parameter estimatesIterType3 Details of the iterations to compute Type III testsLastGrad Last evaluation of the gradient of the likelihoodLastHess Last evaluation of the Hessian of the likelihoodLRCI Likelihood ratio confidence intervals for parametersModelFit Deviance and chi-squared goodness of fit statisticsModelInfo Names of variables in the modelObStats Observed data, covariates, fitted parameters, goodness

of fit diagnostics, and residualsParameterEstimates Parameter estimates and Wald test statisticsType1 Type I likelihood ratio measures of statistical

significance for model parametersType3 Type III measures of statistical significanceType3W The Type III Wald statisticsType3LR The Type III likelihood ratio statistics

All tables of SAS output in this book are created using ODS to build PostScript files.These files were created using SAS code of the form

ods printer file=’c:\mydir\Table.ps’ ps;ods printer select ModelFit;ods listing select ModelFit;proc genmod;

model y = cov1 cov2;run;ods printer close;ods listing close;

In this example, all of the SAS output is omitted except for ModelFit. The ModelFitoutput includes estimates of the fitted parameters, and their Wald tests and confidenceintervals. This material appears in the Output window of the SAS session. A physical filecalled Table.ps is created that contains this text in PostScript format.

PostScript text can be continuously stretched to fit any specified size. An example isgiven in Output A.1. In this table, the PostScript file from Output 2.1 is displayed in threedifferent sizes.

Appendix A The Output Delivery System 175

Output A.1X variablesThe GENMOD Procedure


Observation

micewithtumors

exposurestatus

1 1 1

2 1 0

3 0 1

4 0 0

X variablesThe GENMOD Procedure


Observation

micewithtumors

exposurestatus

1 1 1

2 1 0

3 0 1

4 0 0

X variablesThe GENMOD Procedure


Observation

micewithtumors

exposurestatus

1 1 1

2 1 0

3 0 1

4 0 0

176

Appendix

BProgramming Statements forGeneralized Linear Models

The GENMOD procedure is based on an implementation of the generalized linear modeltheory popularized by McCullagh and Nelder (1989). This appendix is not a substitute forthe SAS documentation on GENMOD, but rather presents some techniques that you mightfind helpful in fitting additional generalized linear models.

There is no reason to restrict the use of GENMOD to problems of discrete data ascovered in this book. The generalized linear model methods apply to a wide variety ofsettings that include continuous as well as discrete outcome data. The flexibility of theGENMOD procedure allows for the development of such models as truncated Poissonregression (Chapter 7). The extended hypergeometric regression (Chapter 8) of this bookis not a generalized linear model but it can be fitted using GENMOD.

The methodologic setting for generalized linear models is that a random variable y hasmean µ and variance σ 2. There is a function g(), referred to as the link function, such that

g(µ) = x′β

for vector of covariates, x. That is, there is some function g of the mean that is expressibleas a linear function of covariates in the data. The variance σ 2 of y could also be specifiedas a function of covariates.

The generalized linear models also need a way of getting back the mean µ of y fromthe linear function of covariates x′β. This function is denoted by g−1 and is referred to asthe inverse link. Mathematically,

µ = g−1 (x′β

).

The value of the likelihood of y is needed along with the maximum likelihood estimateof µ. For any observed value of y an unconstrained maximum likelihood estimate of µ

is needed and denoted by µ. The value of µ is the value of the mean parameter of y thatmaximizes the log of the probability or log-likelihood

l(y | µ) = log Pr[y | µ].The model specifies that g(µ) is expressible as a linear function of covariates x, but

there is no such restriction placed on µ. In other words, l(y | µ) is the largest value thatthe likelihood can attain, over all possible values of the µ parameter. In general, µ is not alinear function of x.

The deviance is twice the difference

deviance (µ) = 2{l(µ) − l(µ)}between the likelihood at its maximum and some other parameter value µ. The deviance isalways positive.

A wide variety of linear models can be specified using all of the methods describedso far. Many of these are available as options in GENMOD without the need for furtherprogramming. Examples for discrete valued data include such useful models as log-linearmodels, logistic regression and Poisson regression. Specifying such functions as the linkand deviance allows for an even wider set of models, such as truncated Poisson regressionand hypergeometric regression.


GENMOD provides a set of programming variables that facilitate the development ofnew models. The response variable is specified as RESP , the mean is MEAN , and thelinear estimate is XBETA . From these programming variables you have to calculate avariance, the link function of MEAN , the inverse link of XBETA , and the devianceevaluated at MEAN .

An example of the programming steps needed for the Poisson distribution are given inthe SAS GENMOD documentation (1999, pp. 1370–75). Programming the Poisson distri-bution is not necessary but illustrates its place in the generalized linear model framework.The programming steps are

proc genmod;variance var = _MEAN_;deviance dev = 2*(_RESP_*log(_RESP_/_MEAN_)+_MEAN_-_RESP_);fwdlink link = log(_MEAN_);invlink ilink = exp(_XBETA_);

followed by a MODEL statement.For the Poisson distribution, the variance is the same as the mean. In log-linear models

the log of the mean is expressible as a linear function of covariates. That is, the link functiong is the logarithm. The inverse of this link function is the exponential function.

The Poisson log-likelihood is

l(µ) = log{exp(−µ)µy/y!}As discussed in Section 2.5, the y! term can be ignored because it does not contain anyparameters that need to be estimated. The unconstrained maximum likelihood estimate ofthe Poisson parameter µ is the observed value y of the data. That is, for Poisson data, thevalue µ = y maximizes this likelihood.

The deviance for the Poisson model is then

deviance (µ) = 2{y log(y/µ) − y + µ}.This function of the mean (µ = MEAN ) and the response variable (y = RESP ) isspecified in the above example from the GENMOD manual.

A second example with a more complicated distribution is described in Chapter 7. Thetruncated Poisson distribution, defined at Equation 7.1, has probability mass function

P[Y = y|Y > 0] = e−λλy/y!(1 − e−λ) (B.1)

defined for y = 1, 2, . . . and parameter λ > 0. The GENMOD macros to fit this distributionare given in Program 7.1 and copied here as Program B.1.

The following program creates the %TPR and %INV macros given in Program 7.1 forfitting the truncated Poisson distribution in GENMOD.

Program B.1 %macro tpr; /* Truncated Poisson regression */y=_RESP_; /* Observed count */lambda=exp(_XBETA_); /* Truncated Poisson parameter */eml=exp(-lambda); /* e to the minus lambda */

Appendix B Programming Statements for Generalized Linear Models 179

/* Truncated Poisson deviance */

dev=0; /* Deviance for y=1 */if y > 1 then do;

%inv(y); /* lam is MLE for y >1 */dev= -lam + y*log(lam) - log(1-exp(-lam)); /* deviance */

end;dev = dev + lambda - y*log(lambda) + log(1-eml);deviance d = 2*dev; /* times 2 for chi-squared distribution */

/* Provide GENMOD with link functions and variance */

%inv(_MEAN_); /* Find lambda for _MEAN_ */fwdlink ey = log(lam); /* Link function: log lambda */invlink linv = lambda / (1-eml); /* Truncated Poisson mean */

/* Truncated Poisson variance */variance var = (lambda*(lambda+1)*(1-eml)-lambda**2)/(1-eml)**2;

%mend tpr;

%macro inv(ev);/* Interval bisection to find lambda from the expected value (ev) */

if &ev LE 1 then /* expected value must be >1 */lam= . ; /* lambda is not defined for ev LE 1 */

else do; /* otherwise iterate to find lambda */lamlo=&ev-1; /* lambda is between exp value less one */lamhi=&ev; /* . . . and the expected value */do until (abs(lamhi-lamlo)<1e-7); /* convergence criteria */

lam=(lamhi+lamlo)/2; /* examine midpoint */mal= lam/(1-exp(-lam)); /* mean at midpoint */if mal GE &ev then lamhi=lam; /* lower upper endpoint */if mal LE &ev then lamlo=lam; /* raise lower endpoint */

end;end;

%mend inv;

The generalized linear model developed in Chapter 7 around this distribution specifiesthat λ is positive and is a linear function of the covariates X j . That is,

λ = exp{α + β1 X1 + β2 X2 + · · · }which can never be negative regardless of the values of the X’s or the regression coeffi-cients α, β1, β2, . . . .

The expected value of the truncated Poisson distribution in Equation B.1 is

µ = µ(λ) = λ/(1 − e−λ) (B.2)

This function of the mean is also the inverse link g−1. Specifically, starting with thevalue of the linear predictor x′β = XBETA , the macro in Program B.1 calculates

λ = exp(x′β).

And from this value of λ, the inverse link is computed from Equation B.2.The variance of the truncated Poisson distribution is

VarY = {λ(λ + 1)(1 − e−λ) − λ2}/(1 − e−λ)2

This variance is also specified as a function of λ in the macro of Program B.1.


To find the deviance, you need to be able to calculate the forward link g. The forwardlink is needed for the deviance because g(y) is the value of the λ parameter that maximizesthe probability (Equation B.1) and, hence, the log-likelihood.

The forward link g(µ) is found by means of an iterative procedure in the macro %INV.Beginning with the expected value, the value of the λ parameter can be obtained. Eventhough the function g(µ) does not have a simple, closed expression, it is still possible to fitthe truncated Poisson model through the use of an iterative approximation in GENMOD.

Appendix

CAdditional Readings

For more general and elementary examples, consult Allison (1999) and Stokes, Davis, andKoch (2001). These books describe other, more general techniques, such as logistic regres-sion, and applications that use SAS procedures, such as CATMOD, FREQ, REG, RANK,and LOGISTIC. The Stokes et al. volume is a good resource for those who do a lot ofstatistical modeling. Their book is full of examples from a wide variety of commonly en-countered settings. Unless your data is very unusual, there is probably an example in Stokeset al. that looks similar to your problem, including the SAS code to perform the analysis.Allison provides a detailed description of logistic regression, which includes many ex-amples and diagnostics.

Zelterman (1999) is a little more advanced and contains examples, SAS programs, andmore theory than Stokes et al. Chapter 5 describes coordinate free methods that are thetheoretical basis for building your own design matrices. Zelterman covers other issuessuch as sample size estimation for planning studies, and describes methods for sparse data.Zelterman is available from SAS Publications.

Agresti (1990) is much more advanced and covers a broader range of topics than anyof the references mentioned above. Agresti provides a wide range of models and methodsand contains a huge number of references. This is a standard reference volume that is usedin many PhD level university courses. It is a more theoretical book and contains only afew programs, SAS or otherwise. Agresti is often the starting point for developing moreadvanced statistical methods.

There are two other advanced references that you should be aware of. Johnson, Kotz,and Kemp (1992) provide a thorough description of the mathematical properties of alldiscrete distributions that are in use today. This book contains extensive references to com-monly used discrete distributions, such as the Poisson and binomial, but also to those dis-tributions that are infrequently used and obscure.

The other important advanced reference you should be aware of is the book byMcCullagh and Nelder (1989). The GENMOD procedure in SAS is an implementationof the generalized linear model methodology popularized by McCullagh and Nelder. Thisbook describes only a small portion of the capabilities of PROC GENMOD because it islimited to discrete valued data. McCullagh and Nelder develop generalized linear modelsin the widest possible sense, and as a result, provide a vast array of statistical methods andtechniques for modeling data.

182

References

Agresti, A. 1990. Categorical Data Analysis. New York: John Wiley.

Agresti, A. and Coull, B. A. 1996. Order-restricted tests for stratified comparisons of binomial pro-portions. Biometrics 52:1103–11.

Agresti, A. and Coull, B. A. 1998. Order-restricted inference for monotone trend alternatives incontingency tables. Computational Statistics & Data Analysis 28:139–55.

Albert, P. S. and McShane, L. M. 1995. A generalized estimating equations approach for spatiallycorrelated binary data: Applications to the analysis of neuroimaging data. Biometrics 51:627–38.

Allison, P. D. 1999. Logistic Regression Using the SAS System: Theory and Applications. Cary, N.C.:SAS Institute Inc.

Altham, P. M. E. 1969. Exact Bayesian analysis of a 2 × 2 contingency table, and Fisher’s “exact”significance test. Journal of the Royal Statistical Society Series B 31:261–9.

Andrews, D. F. and Herzberg, A. M. 1985. Data. New York: Springer-Verlag.

Ashford, J. R. and Swoden, R. R. 1970. Multivariate probit analysis. Biometrics 26:535–46.

Bishop, Y. M. M. 1969. Full contingency tables, logits, and split contingency tables. Biometrics25:283–99.

Bishop, Y. M. M. and Feinberg, S. E. 1969. Incomplete two-dimensional contingency tables. Bio-metrics 25:119–28.

Bishop, Y. M. M., Feinberg, S. E., and Holland, P. W. 1975. Discrete Multivariate Analysis. Cam-bridge, Mass. MIT Press.

Bradley, R. A., and Terry, M. E. 1952. Rank analysis of incomplete block designs I. The method ofpaired comparisons. Biometrika 39:324–45.

Breslow, N. E. 1984. Extra-Poisson variation in log-linear models. Applied Statistics 33:38–44.

Cantor, A. 1997. Extending SAS Survival Analysis Techniques for Medical Research. Cary, N.C.:SAS Institute Inc.

Cochran, W. G. 1954. Some methods for strengthening common χ2 tests. Biometrics 10:417–51.

Cody, R. P. and Smith, J. K. 1997. Applied Statistics and the SAS Programming Language. 4th ed.Upper Saddle River, N.J.: Prentice Hall.

Collett, D. 1991. Modeling Binary Data. London: Chapman & Hall.

Darroch, J. N., Lauritzen, S. L., and Speed, T. P. 1980. Markov fields and log-linear interactionmodels for contingency tables. Annals of Statistics 8:522–39.

Delwiche, L. D. and Slaughter, S. J. 1998. The Little SAS Book. 2nd ed. Cary, N.C.: SAS InstituteInc.

Farmer, J. H., Kodell, R. L., Greenman, D. L., and Shaw, G. W. 1979. Dose and time response modelsfor the incidence of bladder and liver neoplasms in mice fed 2-Acetylaminofluorene continuously.Journal of Environmental Pathology and Toxicology 3:55–68.

Greenwood, R. J., Sargent, A. B., and Johnson, D. H. 1985. Evaluation of mark-recapture for esti-mating striped skunk mephitis-mephitis abundance. Journal of Wildlife Management 49:332–40.

Hewlett, P. S. and Placket, R. L. 1950. Statistical aspects of the independent joint action of poisons,particularly insecticides, II. Examination of data for agreement with the hypothesis. Annals ofApplied Biology 37:527–52.

184 References

Hook, E. B., Albright, S. G., Cross, P. K. 1980. Use of Bernoulli census and log-linear methods forestimating the prevalence of spina bifida in live births and the completeness of vital records inNew York State. American Journal of Epidemiology 112:750–8.

Innes, J. R. M., Ulland, B. M., Valerio, M. G., Petrucelli, L., Fishbein, L., Hart, E. R., Pallotta, A. J.,Bates, R. R., Falk, H. L., Gart, J. J., Klein, M., Mitchell, I., and Peters, J. 1969. Bioassay ofpesticides and industrial chemicals for tumorigenicity in mice: a preliminary note. Journal of theNational Cancer Institute 42:1101–14.

Johnson, N. L., Kotz, S., and Kemp, A. W. 1992. Univariate Discrete Distributions. 2nd ed. NewYork: John Wiley & Sons, Inc.

Johnson, M. P. and Raven, P. H. 1973. Species number and endemism: The Galapagos Archipelagorevisited. Science 179:893–5.

Kastenbaum, M. A. and Lamphiear, D. E. 1959. Calculation of chi-square to test no three-factorinteraction hypothesis. Biometrics 15:107–15.

Lambert, D. and Roeder, K. 1995. Overdispersion diagnostics for generalized linear models. Journalof the American Statistical Association 90:1225–36.

Lauritzen, S. L. 1996. Graphical Models. Oxford: Oxford University Press.

Lee, J. A. H., Hitosugi, M., and Peterson, G. R. 1973. Rise in mortality from tumors of the testis inJapan, 1947–70. Journal of the National Cancer Institute 51:1485–90.

Mantel, N. and Brown, C. 1973. A logistic reanalysis of Ashford and Sowden’s data on respiratorysymptoms in British coalminers. Biometrics 29:649–65.

McCullagh, P. and Nelder, J. A. 1989. Generalized Linear Models. 2nd ed. London: Chapman andHall.

National Center for Health Statistics. 1990. Vital Statistics of the United States, 1988, vol. II, Mor-tality, Part B. Department of Health and Human Services Publication Number (PHS) 90-1102.Washington, D.C.: U.S. Government Printing Office.

Paul, S. R. 1982. Analysis of proportions of affected foetuses in teratological experiments. Biometrics38:361–70.

Plackett, R. L. 1965. A class of bivariate distributions. Journal of the American Statistical Association60:516–22.

Plackett, R. L. 1977. The marginal totals of a 2 × 2 table. Biometrika 64:37–42.

Plackett, R. L. 1981. The Analysis of Categorical Data. 2nd ed. London: Griffin.

SAS Institute Inc. 1999. SAS/STAT User’s Guide, Version 8. Cary, N.C.: SAS Institute Inc.

Shapiro, H., Slone, D., Rosenberg, L., Kaufman, D. W., Stolley, P. D., Miettinen, O. S. 1979. Oral-contraceptive use in relation to myocardial infarction. Lancet 1(8119):743–7.

Stokes, M. E., Davis, C. S., and Koch, G. G. 2001. Categorical Data Analysis Using the SAS System.2nd ed. Cary, N.C.: SAS Institute Inc.

van der Heijden, P. G. M., Zelterman, D., Engbersen, G. B. M., and van der Leun, J. 1997. Esti-mating the number of illegals in the Netherlands with the truncated Poisson regression model.Unpublished abstract, University of Utrecht.

Waller, L. A. and Zelterman, D. 1997. Log-linear modeling with the negative multinomial distri-bution. Biometrics 53:971–82.

Wermuth, N. 1976. Exploratory analyses of multidimensional contingency tables. Proceedings of the9th International Biometrics Conference. V.I.:279–95.

Williams, D. A. 1982. Extra-binomial variation in logistic linear models. Applied Statistics 31:144–8.

Wittes, J. T. 1970. Estimation of population size: The Bernoulli census. PhD diss., Harvard Univer-sity.

Worchester, J. 1971. The relative odds on the 23 contingency table. American Journal of Epidemiol-ogy 93:145–9.

Zelterman, D. 1999. Models for Discrete Data. Oxford: Oxford University Press.

Zelterman, D. 1987. Goodness-of-fit tests for large sparse multinomial distributions. Journal of theAmerican Statistical Association 82:624–9.

Zelterman, D., Chan, I. S. F., and Mielke, P. W. 1995. Exact tests of significance in higher dimensionaltables. American Statistician 49:357–61.

Index

B baseball win/loss records (example) 79–84 beetle mortality (example) 4–8 Bernoulli trials 2 binomial distribution 2–8

beetle mortality (example) 4–8 linear models for 4–8 negative binomial distribution 12–14, 97–100,

121 Poisson distribution, connection to 10

Bradley-Terry model for pairwise comparisons 79–84

brain lesions (example) 72–79

C cancer deaths in Ohio (example) 13–17 capture-recapture samples 101

skunk recapture (example) 111–119 CATMOD procedure, WLS option 45 central chi-squared distribution 150 chi-squared statistic 15–17

central chi-squared distribution 150 non-central chi-squared distribution 150, 152,

155, 157, 159 non-centrality parameter 152, 155, 157–159 Pearson chi-squared statistic 22, 39, 152, 155 Wald chi-squared statistic 26, 43

CHIPOW macro 153–154, 157 circular tables 72–79 CLASS statement, GENMOD procedure 26 CLASS variables, interaction of 59–67

finite population size estimation 101–109 coal miners (example) 53–59 common interaction effect 65 comprehensive models 37 confidence intervals

extended hypergeometric distribution 138 log-likelihood function 51–52

confidence limits for estimated means 25 congenital deformity in infants (example) 105–109 continuity correction 135 covariates, truncated Poisson models with

117–119, 121 cross-classification of ordered categories 59–67

finite population size estimation 101–109

D deviance residual 39 standardized 40 deviance statistic 23, 51, 175

non-centrality parameter 152, 157, 159 discrete distributions

binomial distribution 2–8, 10 multinomial distribution 11–12 negative binomial distribution 12–14, 97–99,

121 negative multinomial distribution 13–17 Poisson distribution 8–10, 11, 176

distribution index 2 diversity of species (example) 92–100

E effect coding 28 estimated means, confidence limits for 25 EXACT statement, FREQ procedure 134 exact test 131, 134 extended hypergeometric distribution 136–139,

147–148

F factorial data 69 finite population size estimation 101–109 Fisher's exact test 131, 134 FITHYP macro 142, 144 FREQ procedure 129–131, 133–136

EXACT statement 134 extended hypergeometric distribution 138–139

fungicide exposure in mice (example) extended hypergeometric distribution 137–138 hypergeometric distribution 129–131, 133 hypergeometric regression 144–148 log-linear models for 2x2 tables 19–30 power for 2x2 tables 154–161

G Galápagos diversity of species (example) 92–100 generalized linear models

log-linear models versus 1 programming statements for 175–178

186 Index

GENMOD procedure See also MODEL statement, GENMOD

procedure CLASS statement 26 extended hypergeometric distribution 147–148 finite population size estimation 101–109 higher-dimensional log-linear models 30–38 hypergeometric regression 140–143, 144–147 likelihood function 45–52 linear models for binomial distribution 4–8 linear models for Poisson distribution 8–10,

176 log-linear models for 2x2 tables 21–30 logistic regression model 7–8 ODS statements 22, 173–175 OPTIONS statement 22 Poisson regression for mortality data 87–92 Poisson regression with overdispersion

92–100 triangular tables 69–72 truncated Poisson regression 112–119, 176

H head trauma (example) 59–67 Hessian weights 25 higher-dimensional log-linear models 30–38

finite population size estimation 105–109 hypergeometric distribution 3–4, 129–148

derivation of 131–136 extended 136–139, 147–148 tail area of 134–136

hypergeometric regression 139–144 comparing several 2x2 tables 144–148 MODEL statement, GENMOD procedure 142

HYPINV macro 138 hypothesis testing, log-likelihood function 51–52

I illegal immigration in Netherlands (example)

119–121 independence in triangular tables 69–72 independence model, parameterizing 25–29 index, distribution 2 infants mortality and prenatal care (example)

161–166 with congenital deformity (example) 105–109 with spina bifida (example) 101–105 INITIAL= option, MODEL statement (GENMOD)

148

interaction effects 64–65 circular tables 72–79 linear by linear interaction 66 sample size estimation 161–166

INTERCEPT= option, MODEL statement (GENMOD) 124, 148

INV macro 112–115, 176, 178 inverse link function 175

J Japan, testicular cancer in (example) 85–92 jittering 77–79

L least square estimation 6, 46 likelihood function 45–52

hypothesis testing and confidence intervals 51–52

lottery winners in New Haven (example) 48–50

maximum likelihood 6, 45, 138 variance of parameter estimate 47

likelihood ratio 51 likelihood residual, standardized 40 linear by linear interaction 66 linear models See also generalized linear models See also log-linear models for binomial distribution 4–8 for Poisson distribution 8–10, 176 LINESIZE= option, OPTIONS statement 22 link function 1, 86, 175 LINK= option, MODEL statement (GENMOD) 8 log-likelihood function 45–52 confidence intervals 51–52 hypothesis testing 51–52 Poisson log-likelihood 45 log-linear models

finite population size estimation 101–109 for 2x2 tables 19–30 generalized linear models versus 1 higher- dimensional 30–38, 105–109 likelihood function 45–52 nested 23, 43 one ordered category 53–59 parameterizing model of independence 25–29 Poisson regression for mortality data 87–92 Poisson regression with overdispersion

92–100

Index 187

power and sample size estimation 149–171 residuals 25, 38–40, 91–92, 96–97, 99–100 tests of statistical significance 40–44 truncated Poisson regression 111–128

LOGISTIC procedure 8 beetle mortality (example) 5

logistic regression 3–4 beetle mortality (example) 5 GENMOD procedure for 7–8

logit transformation 7 lottery winners in New Haven (example)

likelihood function 48–50 truncated Poisson regression 121–128

LRCI option, MODEL statement (GENMOD) 51–52

M macros CHIPOW macro 153–154, 157 FITHYP macro 142, 144 HYPINV macro 138 INV macro 112–115, 176, 178 NULLPOW macro 168 TPR macro 112–115, 123, 176 Mantel-Haenszel estimator 138 marginal significance 41–43 mark-recapture samples 101

skunk recapture (example) 111–119 maximum likelihood 6, 45

extended hypergeometric distribution 138 MAXIT= option, MODEL statement (GENMOD)

142, 144, 148 MAXITER= option, MODEL statement

(GENMOD) 124 Mendel, Gregor 12 MODEL statement, GENMOD procedure

@2, @3 options 56, 106 hypergeometric regression 142 INITIAL= option 148 INTERCEPT= option 124, 148 linear models for binomial distribution 4 linear models for Poisson distribution 8–10 LINK= option 8 LRCI option 51–52 MAXIT= option 142, 144, 148 MAXITER= option 124 OBSTATS option 5, 23–25, 38–40, 173 OFFSET= option 125, 127 SCALE= option 14, 96–97

mortality data, Poisson regression for 87–92 multinomial distribution 11–12

negative multinomial distribution 13–17 Poisson distribution, connection to 11

myocardial infarction and oral contraceptives (example) 139–144

N NCP (non-centrality parameter)

chi-squared statistic 152, 155, 157–159 deviance statistic 152, 157, 159 Pearson chi-squared statistic 152–153, 155

negative binomial distribution 12–14 Poisson regression with overdispersion 97–99 truncated Poisson regression with covariates

121 negative multinomial distribution 14–17

cancer deaths in Ohio (example) 13–17 nested log-linear models 43

deviance statistic 23 Netherlands, illegal immigration in (example) 119–121 NLIN procedure 47 non-central chi-squared distribution 150

deviance statistic 152, 157, 159 Pearson chi-squared statistic 152, 155

non-centrality parameter See NCP non-rectangular tables 69–84

Bradley-Terry model for pairwise comparisons 79–84

circular tables 72–79 triangular tables, independence in 69–72

nonhierarchical models 37 NULLPOW macro 168

O observed information 47 observed zeros 69 OBSTATS option, MODEL statement (GENMOD)

5, 23–25, 38–40, 173 odds ratio 29–30 ODS statements, GENMOD procedure 22,

171–173 OFFSET= option, MODEL statement (GENMOD)

125, 127 OPTIONS statement, GENMOD procedure 22 LINESIZE= option 22 PAGESIZE= option 22 oral contraceptives and myocardial infarction

(example) 139–144 ordered categorical variables 53–67

finite population size estimation 101–109 one category 53–59 two cross-classified categories 59–67

188 Index

Output Delivery System (ODS) 22, 173–175 overdispersion 15, 23

Poisson regression with 92–100 truncated Poisson regression with 119–121

overparameterized models 26

P p-value 150 PAGESIZE= option, OPTIONS statement 22 pairwise comparisons, Bradley-Terry model for

79–84 parameter estimate, variance of 47 parameterizing the model of independence 25–29 partial association 44 Pearson chi-squared statistic 22, 39

non-centrality parameter 152–153, 155 Pearson residual 39 standardized 39–40 perinatal mortality (example) 31–38

marginal significance 41–43 Plackett model 65 Poisson distribution 8–10, 176

binomial distribution, connection to 10 linear models for 8–10, 176 multinomial distribution, connection to 11

Poisson log-likelihood 45 Poisson regression 85–100, 176

mortality data 87–92 overdispersion 92–100 truncated 111–128, 176

population size estimation 101–109 power estimation 149–161

for known sample size 167–171 for 2x2 tables 154–161

prenatal care and infant mortality (example) 161–166

probit regression 8

R raw residual 38–39 reference categories 26 regression hypergeometric regression 139–148 logistic regression 3–4, 5, 7–8 Poisson regression 85–100, 111–128, 176 probit regression 8 truncated Poisson regression 111–128, 176 residuals 25, 38–40 deviance residual 39 Pearson residual 39

Poisson regression for mortality data 91–92 Poisson regression with overdispersion 96–97, 99–100 raw residuals 38–39 standardized deviance residual 40 standardized likelihood residual 40 standardized Pearson residual 39–40 standardized residuals 25

S sample size 2 estimation 161–166

power estimation for known sample size 167–171

SCALE= option, MODEL statement (GENMOD) 14, 96–97

score function 47 severe head trauma (example) 59–67 significance levels 150

tests of statistical significance 40–44 Type 1 significance level 43–44, 92 Type 3 significance level 44, 92

Simpson's paradox 41, 165 size of population, estimating 101–109 skunk recapture (example) 111–119 species diversity (example) 92–100 spina bifida, infants with (example) 101–105 standardized deviance residual 40 standardized likelihood residual 40 standardized Pearson residual 39–40 standardized residuals 25 stroke patients (example) 69–72 structural zeros 69 super-saturated models 26

T tables circular tables 72–79 independence in triangular tables 69–72 non-rectangular tables 69–84 2x2 tables 19–30, 144–148, 154–161 tail area of a hypergeometric distribution 134–136 testicular cancer in Japan (example) 85–92 tests of statistical significance 40–44

marginal and Wald significance levels 41–43 Type 1 and Type 3 significance levels 43–44

TPR macro 112–115, 123, 176 triangular tables, independence in 69–72 truncated Poisson regression 111–128, 176

diagnostics and options 121–128

Index 189

models with covariates 117–119, 121 overdispersion 119–121

tumors in mice (example) extended hypergeometric distribution 137–138 hypergeometric distribution 129–131, 133 hypergeometric regression 144–148 log-linear models for 2x2 tables 19–30 power for 2x2 table 154–161

two-by-two tables comparing using hypergeometric regression 144–148 log-linear models for 19–30 power estimation 154–161 Type 1 significance level 43–44

Poisson regression for mortality data 92 Type 3 significance level 44

Poisson regression for mortality data 92

V variables CLASS variables 59–67, 101–109 ordered categorical variables 53–67, 101–109 variance of parameter estimate 47 W Wald chi-squared statistic 26, 43 Wald significance level 41–43 WLS option, CATMOD procedure 45

Y Y-strain mice (example) 154–161

Special Characters @2, @3 options, MODEL statement (GENMOD)

56, 106 2x2 tables

comparing using hypergeometric regression 144–148

log-linear models for 19–30 power estimation 154–161

Call your local SAS office to order these books

from Books by Users Press

Advanced Log-Linear Models Using SAS®

by Daniel Zelterman .................................Order No. A57496

Annotate: Simply the Basicsby Art Carpenter .......................................Order No. A57320

Applied Multivariate Statistics with SAS® Software,Second Editionby Ravindra Khattreeand Dayanand N. Naik..............................Order No. A56903

Applied Statistics and the SAS ® Programming Language,Fourth Editionby Ronald P. Cody and Jeffrey K. Smith.................................Order No. A55984

An Array of Challenges — Test Your SAS ® Skillsby Robert Virgile.......................................Order No. A55625

Basic Statistics Using SAS®: Student Guide and Exercises(books in this set also sold seperately)by Larry Hatcher .......................................Order No. A57541

Beyond the Obvious with SAS ® Screen Control Languageby Don Stanley .........................................Order No. A55073

Carpenter’s Complete Guide to the SAS® Macro Languageby Art Carpenter .......................................Order No. A56100

The Cartoon Guide to Statistics by Larry Gonick and Woollcott Smith.................................Order No. A55153

Categorical Data Analysis Using the SAS ® System, Second Editionby Maura E. Stokes, Charles S. Davis, and Gary G. Koch .....................................Order No. A57998

Client/Server Survival Guide, Third Editionby Robert Orfali, Dan Harkey,and Jeri Edwards......................................Order No. A58099

Cody’s Data Cleaning Techniques Using SAS® Softwareby Ron Cody........................................Order No. A57198

Common Statistical Methods for Clinical Research withSAS ® Examples, Second Editionby Glenn A. Walker...................................Order No. A58086

Concepts and Case Studies in Data Managementby William S. Calvertand J. Meimei Ma ......................................Order No. A55220

Debugging SAS ® Programs: A Handbook of Tools and Techniquesby Michele M. Burlew ...............................Order No. A57743

Efficiency: Improving the Performance of Your SAS ®

Applicationsby Robert Virgile.......................................Order No. A55960

A Handbook of Statistical Analyses Using SAS®, Second Editionby B.S. Everittand G. Der .................................................Order No. A58679

Health Care Data and the SAS® Systemby Marge Scerbo, Craig Dickstein,and Alan Wilson .......................................Order No. A57638

The How-To Book for SAS/GRAPH® Softwareby Thomas Miron .....................................Order No. A55203

In the Know ... SAS ® Tips and Techniques FromAround the Globeby Phil Mason ..........................................Order No. A55513

Integrating Results through Meta-Analytic Review UsingSAS® Softwareby Morgan C. Wang and Brad J. Bushman .............................Order No. A55810

Learning SAS® in the Computer Lab, Second Editionby Rebecca J. Elliott ................................Order No. A57739

The Little SAS® Book: A Primerby Lora D. Delwiche and Susan J. Slaughter ...........................Order No. A55200

The Little SAS® Book: A Primer, Second Editionby Lora D. Delwiche and Susan J. Slaughter ...........................Order No. A56649(updated to include Version 7 features)

Logistic Regression Using the SAS® System:Theory and Applicationby Paul D. Allison ....................................Order No. A55770

Longitudinal Data and SAS®: A Programmer’s Guideby Ron Cody ............................................Order No. A58176

Maps Made Easy Using SAS®

by Mike Zdeb ............................................Order No. A57495

Models for Discrete Databy Daniel Zelterman ................................Order No. A57521

Multiple Comparisons and Multiple Tests Using the SAS®

System Text and Workbook Set(books in this set also sold separately)by Peter H. Westfall, Randall D. Tobias,Dror Rom, Russell D. Wolfinger,and Yosef Hochberg ................................Order No. A58274

www.sas.com/pubs

Multiple-Plot Displays: Simplified with Macrosby Perry Watts .........................................Order No. A58314

Multivariate Data Reduction and Discrimination with SAS ® Softwareby Ravindra Khattree and Dayanand N. Naik..............................Order No. A56902

The Next Step: Integrating the Software Life Cycle withSAS ® Programmingby Paul Gill ...............................................Order No. A55697

Output Delivery System: The Basicsby Lauren E. Haworth ..............................Order No. A58087

Painless Windows: A Handbook for SAS® Usersby Jodie Gilmore ......................................Order No. A55769(for Windows NT and Windows 95)

Painless Windows: A Handbook for SAS® Users,Second Editionby Jodie Gilmore ......................................Order No. A56647(updated to include Version 7 features)

PROC TABULATE by Exampleby Lauren E. Haworth ..............................Order No. A56514

Professional SAS ® Programmers Pocket Reference,Second Editionby Rick Aster ............................................Order No. A56646

Professional SAS ® Programmers Pocket Reference,Third Editionby Rick Aster ............................................Order No. A58128

Programming Techniques for Object-Based Statistical Analysiswith SAS® Softwareby Tanya Kolosovaand Samuel Berestizhevsky ....................Order No. A55869

Quick Results with SAS/GRAPH® Softwareby Arthur L. Carpenter and Charles E. Shipp ...............................Order No. A55127

Quick Start to Data Analysis with SAS ®

by Frank C. Dilorioand Kenneth A. Hardy ..............................Order No. A55550

Reading External Data Files Using SAS®: Examples Handbookby Michele M. Burlew ...............................Order No. A58369

Regression and ANOVA: An Integrated Approach Using SAS ® Softwareby Keith E. Muller and Bethel A. Fetterman ..........................Order No. A57559

Reporting from the Field: SAS ® Software Experts PresentReal-World Report-Writing Applications .............................................Order No. A55135

SAS ®Applications Programming: A Gentle Introductionby Frank C. Dilorio ...................................Order No. A56193

SAS ® for Linear Models, Fourth Editionby Ramon C. Littell, Walter W. Stroup,and Rudolf J. Freund ...............................Order No. A56655

SAS® for Monte Carlo Studies: A Guide for QuantitativeResearchersby Xitao Fan, Akos Felsovalyi, Steven A. Sivo, and Sean C. Keenan .................................Order No. A57323

SAS ® Macro Programming Made Easyby Michele M. Burlew ...............................Order No. A56516

SAS ® Programming by Exampleby Ron Codyand Ray Pass ............................................Order No. A55126

SAS ® Programming for Researchers and Social Scientists,Second Editionby Paul E. Spector ....................................Order No. A58784

SAS ® Software Roadmaps: Your Guide to Discoveringthe SAS ® Systemby Laurie Burch and SherriJoyce King ..............................Order No. A56195

SAS ® Software Solutions: Basic Data Processingby Thomas Miron......................................Order No. A56196

SAS ® Survival Techniques for Medical Research, Second Editionby Alan Cantor ..........................................Order No. A58416

SAS ® System for Elementary Statistical Analysis,Second Editionby Sandra D. Schlotzhauerand Ramon C. Littell.................................Order No. A55172

SAS ® System for Forecasting Time Series, 1986 Editionby John C. Brocklebankand David A. Dickey ...................................Order No. A5612

SAS ® System for Mixed Modelsby Ramon C. Littell, George A. Milliken, Walter W. Stroup,and Russell D. Wolfinger .........................Order No. A55235

SAS ® System for Regression, Third Editionby Rudolf J. Freund and Ramon C. Littell.................................Order No. A57313

SAS ® System for Statistical Graphics, First Editionby Michael Friendly ..................................Order No. A56143

The SAS ® Workbook and Solutions Set(books in this set also sold separately)by Ron Cody .............................................Order No. A55594

Selecting Statistical Techniques for Social Science Data:A Guide for SAS® Usersby Frank M. Andrews, Laura Klem, Patrick M. O’Malley,Willard L. Rodgers, Kathleen B. Welch,and Terrence N. Davidson .......................Order No. A55854

Solutions for Your GUI Applications Development UsingSAS/AF® FRAME Technologyby Don Stanley .........................................Order No. A55811

Statistical Quality Control Using the SAS ® Systemby Dennis W. King ....................................Order No. A55232

A Step-by-Step Approach to Using the SAS ® Systemfor Factor Analysis and Structural Equation Modelingby Larry Hatcher .......................................Order No. A55129

www.sas.com/pubs

A Step-by-Step Approach to Using the SAS ® Systemfor Univariate and Multivariate Statisticsby Larry Hatcherand Edward Stepanski .............................Order No. A55072

Strategic Data Warehousing Principles UsingSAS ® Softwareby Peter R. Welbrock ...............................Order No. A56278

Survival Analysis Using the SAS ® System:A Practical Guideby Paul D. Allison .....................................Order No. A55233

Table-Driven Strategies for Rapid SAS ® ApplicationsDevelopmentby Tanya Kolosovaand Samuel Berestizhevsky ....................Order No. A55198

Tuning SAS ® Applications in the MVS Environmentby Michael A. Raithel ...............................Order No. A55231

Univariate and Multivariate General Linear Models:Theory and Applications Using SAS ® Softwareby Neil H. Timmand Tammy A. Mieczkowski ....................Order No. A55809

Using SAS ® in Financial Researchby Ekkehart Boehmer, John Paul Broussard,and Juha-Pekka Kallunki .........................Order No. A57601

Using the SAS ® Windowing Environment: A Quick Tutorialby Larry Hatcher .......................................Order No. A57201

Visualizing Categorical Databy Michael Friendly ..................................Order No. A56571

Working with the SAS ® Systemby Erik W. Tilanus ....................................Order No. A55190

Your Guide to Survey Research Using the SAS® Systemby Archer Gravely ....................................Order No. A55688

JMP® Books

Basic Business Statistics: A Casebookby Dean P. Foster, Robert A. Stine,and Richard P. Waterman ........................Order No. A56813

Business Analysis Using Regression: A Casebookby Dean P. Foster, Robert A. Stine,and Richard P. Waterman ........................Order No. A56818

JMP® Start Statistics, Second Editionby John Sall, Ann Lehman,and Lee Creighton ....................................Order No. A58166

www.sas.com/pubs

Welcome * Bienvenue * Willkommen * Yohkoso * Bienvenido

SAS Publishing Is Easy to Reach

Visit our Web site located at www.sas.com/pubs

You will find product and service details, including

• companion Web sites

• sample chapters

• tables of contents

• author biographies

• book reviews

Learn about

• regional user-group conferences• trade-show sites and dates• authoring opportunities

• e-books

Explore all the services that SAS Publishing has to offer!

Your Listserv Subscription Automatically Brings the News to YouDo you want to be among the first to learn about the latest books and services available from SAS Publishing?Subscribe to our listserv newdocnews-l and, once each month, you will automatically receive a description of thenewest books and which environments or operating systems and SAS® release(s) each book addresses.

To subscribe,

1. Send an e-mail message to [email protected].

2. Leave the “Subject” line blank.

3. Use the following text for your message:

subscribe NEWDOCNEWS-L your-first-name your-last-name

For example: subscribe NEWDOCNEWS-L John Doe

You’re Invited to Publish with SAS Institute’s Books by Users PressIf you enjoy writing about SAS software and how to use it, the Books by Users program at SAS Instituteoffers a variety of publishing options. We are actively recruiting authors to publish books and sample code.

If you find the idea of writing a book by yourself a little intimidating, consider writing with a co-author. Keep inmind that you will receive complete editorial and publishing support, access to our users, technical advice andassistance, and competitive royalties. Please ask us for an author packet at [email protected] or call 919-531-7447. See the Books by Users Web page at www.sas.com/bbu for complete information.

Book Discount Offered at SAS Public Training Courses!When you attend one of our SAS Public Training Courses at any of our regional Training Centers in the UnitedStates, you will receive a 20% discount on book orders that you place during the course.Take advantage of thisoffer at the next course you attend!

SAS Institute Inc.SAS Campus DriveCary, NC 27513-2414Fax 919-677-4444

* Note: Customers outside the United States should contact their local SAS office.

E-mail: [email protected] page: www.sas.com/pubsTo order books, call SAS Publishing Sales at 800-727-3228*For other SAS business, call 919-677-8000*

Documents

Discrete Distributions - SAS