Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
· ':1:')--_1 "'~.ili/I""".i•. -.I I
....-....~....... : ..,__'IIII:'..._ .1"ljW ,""'.....-
I i__ '_·I J.~.~ -+- ~~., ..
I I._. ',.,-. .__~"·"""l'~-r-"'-""-·.".
I ! ii·-i--....--r-~-*-T---~'I~~ l.~;'IIIAo~ ._h.',...,....--,-'_
, '
, __............. ......t.,_~-..\•.~~..._.~+-.-
'(_'_""~_"_W-+h"'_"~\'" .
THE APPLICATION DFAN IN±EG~EP $~ OE Ch~iGORICALANALYSIS METHO~s TO A LARGE ENVI~ONMENTAt 'DATASET
WITH REPEATED MEASURES AND PA~TI~YU'CbMPLETE DATA
by
Maura Ellen Stokes
Department of BiostatisticsUniversity of North Carolina at Chapel Hill
Institute of Statistics Mimeo Series No. 1807T
September 1986
•
..
..
,THE APPLICATION OF AN INTEGRATED SET OF CATEGORICAL
ANALYSIS METHODS TO A LARGE ENVIRONMENTAL DATASETWITH REPEATED MEASURES AND PARTIALLY COMPLETE DATA
by
Maura Ellen Stokes
A Dissertation submitted to the faculty ofThe University of North Carolina at ChapelHill in partial fulfillment of the requirements for the degree of Doctor of PublicHealth in the Department of Biostatistics
Chapel Hill
1986
Reader
•
ABSTRACT
MAURA ELLEN STOKES. The Application of an Integrated Set ofCategorical Analysis Methods to a Large EnvironmentalDataset with Repeated Measures and Partially Complete Data(Under the direction of Gary Koch).
The usefulness of some recently developed categorical
data methodology is evaluated through its application to a
large dataset which pertains to environmental health.
Multivariate randomization test statistics are employed in a
variable selection strategy to evaluate the association
between demographic variables in the dataset and the
response variables pertaining to the prevalences of colds
and asthma in children. Subpopulations are formed on the
basis of the results of the variable selection and weighted
least squares methods are utilized to describe the variation
among the response estimates of interest. Principal
attention is given to subpopulations based on area, race and
sex. Various types of modeling techiniques are illustrated,
including the use of residual analysis in assessing a given
model's appropriateness.
Analysis of the partially complete data is undertaken
with two different strategies. Multivariate ratio
estimation involves the calculation of multivariate ratio
e~timates of the means of interest and a corresponding
covariance matrix. Supplemental margins is a linear models
strategy in which the complete and incomplete data are
combined by treating the incomplete observations as members
of distinct subpopulations. These methods as well as some
variations are applied to this dataset and their advantages
and disadvantages evaluated.
•
ii
ACKNOWLEDGEMENTS
I gratefully acknowledge Gary Koch for his continuedqUidance, support and patience with me (although I stilldeny losing the auditron). I would also like to thank themembers of my committee, Craig Turnbull, Keith Muller, CarlShy and Kerry Lee for their efforts.
I would like to express my appreciation to my parentsfor making the idea of graduate school possible.
I would like to thank my brother Matt for his help inediting this dissertation.
I would like to express my appreciation to my friend,Lisa Lavange, for establishing that one can finish adissertation expediently, as well as get married, have achild, and establish a career so that the pressure was offme.
I am grateful for the support of friends, co-workers,and the often unfortunate office mates during the manycolorful phases of this entire process. I especiallyacknowledge the help it was to have company during the latenight activites at the trailer, RTI, and SAS.
And lastly, here's to Kate Lavange for not havingacquired the vocabulary yet to ask when it would befinished .
iii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 i
I. REVIEW OF STUDIES CONCERNING AIR POLLUTION ANDHEALTH AND DISCUSSION OF CHESS PULMONARY DATA...... 1
1.1 Overview of Air Pollution Studies............. 11.2 CHESS Studies................................. 81.3 Children's Pulmonary Function Data............ 91.4 Pulmonary Function Data Analysis.............. 111.5 Data Structures for Categorical Data Analysis. 131.6 Data Description.............................. 161.7 Overview of Research.......................... 19
II. RANDOMIZATION TESTS METHODOLOGy.................... 38
2.1 Introduction............. . . . . . . . . . . . . . . . . . . . . . 382.2 Research Design Implications.................. 392.3 Randomization Test Methods.................... 41
2.3.1 First Order Association .2.3.2 Partial Association .2.3.3 Average Partial Association Methodology ..2.3.4 Mean Score Test .........•................2.3.5 Correlation Test .
4145485256
2.4 Multivariate Randomization Statistics......... 582 • 5 Summary....................... Ii • • • • • • • • • • • • • • • 61
III.WEIGHTED LEAST SQUARES METHODOLOGy................. 63
3.1 Introduction.... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2 Weighted Least Squares Methodology............ 65
3.2.1 Overview.................................. 653.2.2 Statistical Theory for Weighted Least
Squares. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.2.3 An Example of a Strictly Linear Model..... 743.2.4 Case Record Data.......................... 79
3.3 Overview of Repeated Measurement Analyses..... 833.4 Repeated Measurements Analysis for Categorical
Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893.5 Missing Data Strategies for Categorical Data
Analys is . 93
3.5.1 Ratio Estimation.......................... 943.5.2 Supplemental Margins...................... 97
3.6 Summary....................................... 104
IV. ANALYSIS OF COMPLETE DATA: VARIABLE SELECTION 106
4.1 Introduction .4.2 Variable Selection for Categorical Data .4.3 Variable Selection Extended to Multivariate
Response Prof i 1es .4.4 Application of Multivariate Variable Selection
to CHESS Data for 2+ Colds and 1+ Asthma in19 73, 1974, and 19 75. . . . . . . . . . . . . . • . . . . . . • . . .
V. LINEAR MODELS ANALYSIS OF COMPLETE DATA .
5.1 Linear Models Analysis of 1+ Asthma Data .5.2 Linear Models Analysis of 2+ Colds for 1973,
19~4, and 1975 for the Sex x Race x AreaCross-Classification .
5.3 Linear Models Analysis for 1+ Asthma for theArea x Sex Cross-Classification for 1973,1974, and 1975 Combined .
5.4 Linear Models Analysis of Mean Colds for 1973,1 974, and 19 7 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VI. ANALYSIS OF INCOMPLETE DATA .
6.1 Introduction .6.2 Univar1ate7Afta~~sisof 1+ Asthma in 1973, 1974
and 1975 .6. 3 SupplemenJta1·~Marg1nsC in the Analysis of the
Proportions of Colds Reported in 1973, 1974,and 19,?i;~~.~it-\~t~~,~.~~\; .
6.3. 2 Supplimeft't~qqtsilrg!i1..was a Means ofExtending the Complete Sample Size .
6.4 Multivariate Ratio Estimation for the Analysisof Mean b~f1ai1'~ol\5the S'!x" Po1nt Data .
6.5 Analysis of the Proportion of Those ReportingAsthma in 1973, 1974 and 1975 withMultivariate Ratio Estimation .
VII.DISCUSSION .
BIBLIOGRAPHY .
iv
106107
110
112
122
122
135
155
161
174
174
175 e185
202
211
232
244
248
..
CHAPTER I
REVIEW OF STUDIES CONCERNING AIR POLLUTION AND HEALTH
AND DISCUSSION OF CHESS PULMONARY FUNCTION DATA
1.1 Overview of Air Pollution Studies
Air pollution has been a major concern of
industrialized countries for many decades. Besides being
credited with damaging the environment and the ecosystem,
air pollution has also been blamed for adversely affecting
human health. Such concern was part of the motivation for.Ci.e. .':.:~ ~=; J >.
the establishment of the U.S. En~ie~~~»t~}sp'~ptection
.. ~ . " . ~ _, c t,:' C ~L
Agency as well as the passag•• ~ oja~!Je.st<?If!.~J1r:!'!~~.' Act to set
permissable concentrations 9f .l!JP'9.Jf~~.. P9,U.\\t;ants. In the
past fifteen years, over2.l:W§Btf·:':.§t~UPJ.~~f!~yebeen conducted~:~"~jiorr:.~"J:i r:tr~.} r;.t1.Lht7.~3
to estimate health costs incurred due'to air pollution. The_c;ls~!1aJ Ot1S£ ~·~S~·ft
resulting figures range:;f:J;'PJIl ~"-::few::;~uf].p..r§4 million to ten
billion dollars per year (Herm~ri~1917i~~
However, scientific documentation of the harmful
consequences of air pollution has been difficult to obtain.
In addition, the exact mechanisms by which it might cause
damage are still being investigated. Historical events
which seemed to create a fairly strong case against air
pollution were acute episodes in which weather-induced air
stagnation served to increase markedly the concentrations of
pollutants. These were primarily sulfur oxide and
2
particulate complex resulting from the burning of coal.
Excessive mortality due to incidents of this nature occurred
in Meuse Valley, Belgium in 1930, Donora, Pennsylvania in
1948 and New York City in 1953 (Shy et. al. 1978). However,
the most severe example is the London fog of December, 1952,
in which up to four thousand deaths were attributed to the
unusual concentrations of air pollutants. These were mainly
the result of bronchitis, pneumonia, and other respiratory
and cardiac diseases. These episodes served to focus
attention on the potentially harmful consequences of air
pollution. Soon, some types of controls were established in
several countries. Also, systematic investigations into the
cause and effe91~~~~~~r~~~!~~tionwere initiated.
Besides th~e.,..,.~U,,~f:gt;. Q~id~,and particulate complex,\n,.. ':l .•_f1~ ...c . .1£1. ,
there are two oth~~ m~~~~m~¥¥~~ ~f air pollution which are
recognized: photoc;Q~Gat",o~q,'Jlts.,andmiscellaneous, .... ' .....~ .... , .~ -. 1...1.·.·~.~_JlJtl ,
which are producecl ,~t /.p~.tnt!tq~rces such as smelters, mines,......... J.•. ,. "....-;,.B ~._._.
and factories. The sulfur oxide/particulate complex is
believed to have the deleterious health effect of increasing
the risk of acute and chronic respiratory disease and
aggravating chronic lung disease. The photochemical
oxidants can cause eye irritation and respiratory symptoms,
including coughing and choking. Exposures to asbestos can
lead to malignancies of the lung, while high levels of
mercury can result in central nervous system trouble and
renal toxicity.
c
3
Various studies have been conducted to investigate the
connection between air pollution and daily mortality (Martin
1964: Buechley and Riggan 1913). One example. is a study led
by McCarroll and Bradley (1966) which examined daily deaths
in New York City from 1896-1965. They found episodes of
unusually high mortality which corresponded to days of high
pollution. Low wind speed and temperature inversion were
also recorded for those periods. They studied five of these
episodes closely and found that a rise in mortality occurred
on the peak days of the pollution and also that all ages
were affected.
Most of these types of studies found greater mortality
on those days with higher-than~u.~il~~Bli&tion. However,
confounding factors in this ·r~rett-toIf§lti~iBm~J:iricludeweather
condi tions, season, and flll e:-Bf-d3m?8f~~::' iW'his study in New
York City from 1962-1966;:":i«~diiIey-i:'-(i-g73if'::cittemptedto
control for such extenti~ti~Nqiacf8;:~d~ieciookedat daily
reported deaths from all '~c'ati'§~~, ~:-cfrfcfC'adfuJ-ted their counts!"
for seasonal cycles, temperature ex~remes, holidays,
weekdays, and influenza epidemics. SO was still found to
be correlated with daily mortality after these adjustments,
as were the particulate pollutants.
Researchers have also directed attention to the
relationship between air pollution and morbidity. Those
studies examining a potential association between chronic
respiratory disease and air pollution also noted problems
with potentially confounding factors such as smoking, nealth
4
conditions, and exposures at the workplace. Many studies
have been undertaken, including some in the United States,
Britain, and Canada, and most have indicated a positive
association between chronic respiratory symptoms and
pollution -- specifically sulfur oXide and particulates
(Lambert and Reid 1970; Bates 1967; Chapman et al. 1973).
Lambert and Reid did a mail survey of 9975 persons in Great
Britain and found that prevalence rates for symptoms
increased with increasing air pollution, and that cigarette
smokers had higher rates than non-smokers. However, there
are so many potential confounders that it's difficult to
assess the effect air pollution had by itself. Occupational
exposures, socioeconomic factors, selective migration, and", r;:; r..::':~~}2 '-::)1.'"
smoking behavior tend,to cloud the issue and the progressive". ~~~ ~~c~'q~=2!:~'
nature of chron~~ 5~~1Hr~~~rZ5~1:;~asemakes it difficult to
evaluate the effect of a~r pollution on the disease.'to: -:::.£i od€~.. ~. ?!1':':"'!.:'
The inciden5~ gt~a¥.~i£~f~s~2gatorydiseases has also
been observed to be higher in areas with high levels of:.":' :'~!.\J ::'it"'C': ~.
pollution. Dohan and Taylor (1960) studied female RCA
employees in several U.s. cities during 1957-1960. They
looked at respiratory illness lasting seven or more days,
and found it correlated with sulfation rates. This study
did not adjust for season, which can greatly affect the
occurrence of respiratory disease. Later studies did adjust
for season and temperature, as well as for social class.
White collar workers at a New York City insurance company in
1965-1967 were found to have higher daily respiratory
•
•
•
•
•
•
•
5
disease absences during periods of higher S02 and
particulate concentrations, even after adjusting for season
(Verma et al. 1969). Levy (1977) found that hospital
admissions for respiratory disease in Hamilton, Ontario were
increased on days of heavy pollution, also adjusting for
season.
Another respiratory disease that has been studied with
respect to air pollution is bronchial asthma. Schrenk et
al. (1949) found that eighty-eight percent of asthmatics
living in Donora, Pennsylvania during the 1948 episode
reported having symptoms during that period. Some studies
have focused on the incidence of asthmatic attacks during
periods of acute air pollution, ~h~n"'i~~~l~~'ly asthmatics
would become one of the more sli~ge~dbl~::~foii~s. Emergency
room visits for asthma incr~~~ed~lb~~~~?~~6f seven New York
City hospitals studied dJ:~1~g[k:·i9·~i.(';al~ ~ollution episode
(Glasser et al. 1976). S6~~?~fual~~Edla h~~ find an
association between emergency"rg~m~t1§1~s~~ndair pollution
levels (Rao et al. 1973; Sultz et al. 1970). Still other
studies have found a link between asthmatic attacks and
level of pollution in the community (Yoshida 1976).
The effect of air pollution on the occurrence of
respiratory illness in children has also been the subject of
investigation. Their high susceptability to respiratory
illness would appear to make them choice candidates for
early victims of environmentally-induced respiratory
problems. Also, with this study population, the role of
6
smoking behavior in the overall picture is no longer a
factor to be controlled, at least for very young children.
As the authors of "Health Effects of Air Pollution" state:
' ... an impressive number of studies has consistentlydemonstrated an association between acute respiratoryillness rates in children, particularly illnesses ofthe lower respiratory tract, with residence in morepolluted communities affected by the sulfur oxide/particulate category." (Shy et at. 1978)
One major study was conducted in Great Britain,
beginning in 1946. More than 3000 children born in Great
Britain during the first week of March of that year were
followed and evaluated periodically as part of a
longitudinal health survey effort. Subjects were grouped
into one of four poll~t~on categories, mostly depending on
coal consumption in ~~~~~egipn. A combination of mothers'
reports, doctors' F~~Q~~Q~n~~ealthexaminations were used
to gain information:; 5'1Jo !!~~ 9B-A.~d/ s respiratory history.
There was a defini~,~~se~~~~~en ~etween lower respiratory
symptoms and polluti.9nCf,,:1:~~pry; in fact, there was a
gradient in level of reported disease from the lowest
pollution area to the highest. There was not a similar
association for upper respiratory disease symptoms (Douglas
and Waller 1966).
Another study conducted in England in 1964 concerned a
group of 819 five year olds in one of four areas of
Sheffield selected for their varying degrees of air
·pollution. This was based on smoke/sulfur dioxide
gradients. Researchers found that there was an association
between air pollution level and incidence of both upper
7
respiratory symptoms and lower respiratory symptoms. Force
expiratory volume (FEV) and forced vital capacity (FVC) were
measured as well, but were not significantly different from
one area to another (Lunn et al. 1967).
Several other investigators have used schools as a
base for their study of children. Toyama examined peak flow
rates for schoolchildren living in Osaka and Kawasaki,
Japan, both heavily polluted from industrial sources. He
found that those children living in the more polluted areas
had lower peak flows than those living in the less polluted
areas. Ferris studied pulmonary function as well as school
absences in first and second graders in Berlin, N.H., from
January 1966 to June 1967. Pollutiott~~rom pap$r mills is a....., t' " '. ~ t. t· .- . '. _' ....., ':.
major environmental problem here~t:':'Whl1e school absences
were not significantly differen~"'to=r:1tno$'e in the more
polluted neighborhoods, the"peilC ~16w -rates were lowest for
those from the more heavily po·l1.~-te'd<:;~eas (Ferris 1970).
Shy, et al. (1973) found some dfr'fef.ences::'1n ventilatory
function for certain race/age groups of children living in
the more polluted areas of Cincinnati, Chattenooga, and New
York City. However, the differences were usually fairly
small. Researchers at Akron, Ohio monitored air pollution
levels at two elementary schools and conducted pulmonary
function tests once children were discovered to be
symptomatic of acute respiratory disease. They concluded
that those children at the school with higher S02 and N0 2
pollution had higher incidences of disease: also, their
a
pulmonary function was further decreased from that of the
other school.
1.2 CHESS Studies
In 1967 EPA organized what was to be a multi-million
dollar effort to assess air pollution's effect on health in
the United States called the Community Health and
Environmental Surveillance System (CHESS). The system
actually encompassed many different studies, some
retrospective and some prospective. The common thread was
to involve communities or areas with exposure gradients for
a particular pollutant pollution associated with sulfur
oxides, particulates and oxidants was under investigation.
In addition, factors such as age, race, sex, and socio-
economic status were to ,?e._controlled, either through the
use of homogeneous communi tie, or as part of a later
statistical analy,is. Such populations as the elderly,;:':-"f: ..
asthmatic, and children w~re to be investigated, and health~ 1~·lr.
indicators such as disea~~ occurrence and pulmonary function
used. In 1972, over 250,000 people were involved in CHESS
studies (Report to the U.S. House of Representatives
Committee on Science and Technology 1980).
However, when the first set of CHESS data was
collected and published in 1974, the program became very
controversial. There were problems with data quality,
particularly the air monitoring data, and also concern with
some of the health questionnaires used. A Congressional
investigation ensued and concluded that the data pUblished
•
9
could not be used to support its estimates for the specific
levels of pollution which were associated with serious
health effects. As a result, other CHESS datasets were
sUbjected to intense data validation efforts which were not
completed until 1978. In addition, the Congressional report
detailed the limitations of certain aspects of the CHESS
data.
1.3 Children's Pulmonary Function Data
One of the CHESS studies was directed at assessing the
relationship between air pollution and pulmonary function in
children. From 1972-1975, pulmonary function, as measured
by forced expiratory volume at .75 seconds (FEV), was
evaluated in the fall, winter, and spring for over 20,000>
elementary school children. Areas were selected according
to a criterion of expected pollution gradients, and included
Charlotte, N.C., Birmingham, Alabama, New' York, the Salt
Lake Basin in Utah, and two separate areas in the Los
Angeles Basin in California. Withiri'these areas,
communities, referred to as sectors, were chosen for study
for being similar to each other in terms of demographics but
varying with respect to degree of pollution exposure. Each
area included from two to six of these sectors and were
basically white and middle-class. The study included
children who were in the second, third, or fourth grade in
Fall, 1972. Measurements were taken nine times; these were
Fall 1972, Winter 1973, Spring 1973, Fall 1973, Winter 1974,
Spring 1974, Fall 1974, Winter 1975, and Spring 1975.
10
Besides FEV.15, other information gathered for each subject
included school, grade, birth date, race, sex, height, and ~
self-reported resp~ratory symptom which indicated whether or
not the subject had a cold and/or asthma at the time.
Pollutants that were monitored included total suspended
particulate matter, suspended sulfates, and sulfur dioxide.
In California, ozone levels were also monitored. Quarterly
geometric means were calculated from daily measurements of
the monitoring stations. On the whole, these were situated
such that most of the subjects' residences were within two
miles of the stations, which were sometimes located at the
schools themselves. There was a great deal of trouble with
the aerometric data. Procedural errors resulted in a
negative bias in the total";sulfate particulate levels of
from ten to thirt~'percent: Similar problems with suspended
sulfate measurements left the first two years of these data
with as much as a 50~ negat1vebias. Other methodological
and shipping errors led to data quality problems with sulfur
dioxide as well.
Researchers at the University of North Carolina/Chapel
Hill, under the direction of Carl Shy, entered into a
contract with EPA to clean and edit this particular dataset
and then perform a statistical analysis. Various stages of
data processing were required to produce a database of
sufficient quality to analyze. The EPA raw data consisted
of one record per subject for a year's data - i.e. fall,
winter, and spring measurements. A total of 60,836 such
11
records were available. There were many problems with these
data, including missing records due to absences or
migration, invalid date values, and difficulties in matching
records from one of the records to another. Data editing
was performed by Keith Muller and 30anna Smith, the end
result of which was an analysis file which contained
complete records, i.e. nine time points, for 3,666 sUbjects.
This involved matching 18,714 valid first year records,
20,980 second year records, and 21,142 third year records by
area, sector, school, and name. If perfect name matches
were not made, names were transformed to a Soundex-like
representation and matched to the third year records (Muller
et a1. 1981). Sex, race, birth mQn~h and birth year were
required to match across all th~~e years, with the possible
exception of one mismatch out of the twelve possible. Those
records missing values for FEV were .deleted, as well as
those corresponding to sUbje~t., ~~io.rti9g asthma at any of
the nine time points.
1.4 Pulmonary Function Data Analysis
The focus of the original analysis of these data was
to assess whether there was a relationship between pulmonary
function and air pollution in children. A multivariate
analysis of variance was used in order to investigate this
relationship. The dependent variables modeled were the nine
FEV.75 measurements for fall, winter, and spring for all
three years. The variable used to account for air pollution
exposure was an indicator variable for sector of residence.
12
This was considered appropriate since the sectors were
chosen according to an expected pollution gradient implied
by historical data. The pollution data collected showed an
observed gradient that, on the whole, came close to the
expected gradient (Hasselblad et ale 1974). The data was
split into 10' and 90' samples via a stratified random
procedure. Regressions were performed on a set of
demographic variables in the 10' sample to determine
relevant covariates. Those chosen were race, sex, height,
height squared, age, and age squared (Muller et ale 1981).
Seven analyses were performed altogether. Six were
area specific, in which case the important predictor was
sector, and the other design factors were year and season.
The seventh analysis included all the areas, and area
replaced sector as a factor~n its design. The basic
findings of the all-areas:analtsi~involved interactions
with area. For the within Birmingham, within Utah, and
within New York comparisons, significant relationships were
found between FEV.75 and sector. In Charlotte and
California II, no significant relationship was found, but in
these areas the expected pollution gradient was not observed
either. For California I, one did find the expected
pollution gradient but no relationship between sector and
pulmonary function was eVident. The authors concluded that
these analyses, taken together, supported a relationship
between pollution and patterns of pulmonary function in
children (Muller et ale 1981).
13
1.5 Data Structures for Categorical Data Analysis
The motivation for this dissertation is the analysis
of the categorical response measures in the CHESS dataset.
As mentioned above, a self-reported measure is included
which indicates whether or not a subject had a cold and/or
asthma at the time of the pulmonary function test. In
addition, this dissertation will deal with all the data
vectors collected, both complete and incomplete. Thus,
additional data management and the construction of new data
structures were required. The data profiles fall into one
of seven groups -- those corresponding to sUbjects with data
for each of the years 1913, 1914, and 1915, and six other
groups with data for various combinations of those years as
illustrated below:
",:.(
1. 1913, 1914, anet-19152. 1913_ a~d 1914 "'(.'3. 1913 and 19154. 1914 and 19155. 1913 only6. 1914 only1. 1915 only
For each year represented in one of these groups, there is
data for three time points corresponding to a fall, winter,
and spring measurement. Thus, there are three basic types
of profiles represented in this dataset. The first is the
complete data profile, consisting of nine time points per
observation. The second consists of six data points and
three missing, of which there are three kinds corresponding
to either 1913, 1914, or 1915 being missing. The third
profile is that which consists of only three data points,
14
with six missing. These arise when there is data for one
year only, so there are also three ways of producing this
particular profile.
The data management required to produce a dataset
containing these profiles consisted of going back to an
intermediate stage in Keith Muller's data management in
which records without complete data were left behind, and
merging them with a file which included the 'complete data'.
Edits done included deleting those records which were
missing key demographic information. For both the doubles
(those having data from two years) and singles (data from
only one year), observations were deleted if sex, race,
birthmonth or birthyear were missing at any of the six or
three time points. If there was a disagreement on a
demographic variable value for a doubles observation, it was~ s.
deleted. In order to keep the data consistent with that in
the original analysis file, records missing FEV.75 were also
eliminated. It was not felt that any data of consequence
would be lost due to this last action, since FEV.75 was the
focal point of a measurement period, and, if missing, tends
to throw into doubt the validity of other data registered
for that period.
The above editing process led to a dataset consisting
of 20,392 records, distributed as follows:
Data Profile1973 1974 1975 Number of Observations
15
yesnonoyesyesnoyes
noyesnoyesnoyesyes
nonoyesnoyesyesyes
4165342649791208
43421314049
The most striking impression one gets from this table is the
five-fold increase in the number of observations from
approximately four thousand to twenty thousand when the
incomplete data is included. The relatively low number of
observations for the 1973 and 1975 only profile is
understandable. One potential explanation is migration, with
the odds of a family returning a year later seemingly
limited.
Two main types of data structures were created for
initiating statistical analysis. The first consisted of
data for individual years, i.e. all those observations which
had data for 1973 went into the 1973 dataset, regardless of
which profile they represented. Similarly, datasets were
constructed for 1974 only and 1975 only. 9806 records were
included in the 1973 dataset, 10762 records in the 1974
dataset, and 11560 records in the 1975 dataset. The
categorical response variables had to be available for all
three time points in a year in order that an observation
qualify for a particular 'year' dataset (136 observations
did not meet this criterion and were excluded). It should
be noted that an individual could be represented in one,
two, or three of the 'year' datasets. The other data
16
structure of primary interest is that of the complete data,
those which had categorical data for all nine time points.
There were 4002 of these observations. The reason that this
complete dataset contains more data than the analysis
dataset discussed in section 1.4 is that the latter excluded
asthmatics. Also, it should be noted that the 4049 records
listed in the profile figures on the preceeding page had
data for each year, but not necessarily for each of the
three time points within each year. So, 47 records were
classified as having a three-year profile from the point of
view of the individual year datasets, but were not
acceptable into the 'complete' dataset.
1.6 Data Description
Tables 1.1-1.4 contain the demographic
crosstabulations for the 1973, 1974, 1975, and Complete
datasets respectively.- -.;Total numbers of White, Black,
Hispanic, and Other races are displayed within sex and
geographic area. Females have a slight advantage over males
in the Complete data at 51%, while this is reversed for the
individual datasets, as the percent of males is
approximately 51% for them. Birmingham is the area with the
most subjects and New York City has the fewest. The
Complete data is 79% White, 15% Black and 3% Hispanic with
the remaining 1% classified as Other. Race percentages for
the individual year datasets follow the same pattern
closely, with White ranging from 75%-79%, Black ranging from
15%-19%, Hispanic ranging from 3%-4% and Other at 1%-2%. It
..
17
should be noted that several cells of this area by sex by
race demographic crosstabulation were not represented in the
Complete dataset.
Tables 1.5-1.8 contain information related to the
classification of each area by a pollution index. The index
was created from the aerometric data collected as part of
the study and displayed in Muller et al. (1981). Average
sector ranks were calculated for Total Suspended
Particulates for 1972-1975, Total Suspended Sulfates for
1972-1975, and Sulfur Dioxide for 1972-1975. Scores of one,
two, and three, were assigned to the sectors for each of the
three types of measures, whe~e 111 was assigned to the
highest one-third rankings, 121 was assigned to the middle
third rankings, and 13 1 was assigned-to :the lowest third.
These scores were then added, resulting in total scores
ranging from 3 to 9. The following ,pollution index was then
created: 1--7,8,9, 2--5,6, and 3~-3,4•. Thus, '1 1
corresponds to lower pollution, 121 corresponds to medium
level pollution, and 13 1 corresponds to higher pollution
levels. Of course, these labels are relative to the CHESS
data, but they do provide a pollution gradient. There were
nine sectors classified as l~w pollution, mostly in
Birmingham. Six sectors were assigned to the medium
pollution group, and the remaining seven sectors were
classified as high pollution. Tables 1.5-1.8 display the
number of subjects which comprise the cells of an area by
sex by pollution index crossclassification for the 1973,
18
1974, and 1975 and Complete data. Note that there will be
cells with no elements since not every area will have
sectors representing each of the pollution levels.
The outcome variables of interest depend on the data
structures which are being analyzed. For the 1973, 1974,
and 1975 datasets, a variable was created which indicated
the number of times a subject reported having a cold that
year. Thus, the possible values are 0,1,2, and 3 since there
were three measurement periods. A similar variable was
created to indicate the number of times asthma was reported
during the course of the year. Tables 1.9-1.11 display the
mean number of colds reported by area, sex, and race for
1973, 1974, and 1975 .. Charlotte has the highest mean colds
in 1973 and 1974 with· mean colds of .94 and .93, while Utah
takes the lead in 1975c.with.a mean number of colds of .94.
NYC has the fewest cold~:per.year for all three years. Its
lowest value is .62~colds.per year for 1974. The orily sort
of trend one might discern by looking at these tables is
that females consistently report more colds than males.
Table 1.12 is concerned with the proportion of children
reporting colds during each of the measurement periods for
the Complete data. Proportions are listed for males and
females within each of the six areas. Charlotte has the
highest proportions of colds in general, and females have
more colds than males at each of the nine time points.
Tables 1.13-1.15 display the mean number of colds
reported by area, sex, and pollution index for 1973, 1974,
•
•
•
•
•
•
19
and 1975 data. The mean is lowest for the higher pollution
category for all three years, which is interesting. The
range of values is fairly close, as they have a low at .73
for higher pollution in 1973 and a high at .83 for lower
pollution in 1974. Again, females consistently have a
higher number of colds than males. Finally, Table 1.16
displays the proportion of colds at each measurement period
by sex and pollution index for the Complete data. Females
are still reporting more colds than males within each of the
pollution categories.
1.7 Overview of Research
The overall purpose of this dissertation will be to
evaluate the usefulness of some recently developed
categorical data methodology ..in the analysis of a very large
dataset pertaining to an environmental.health problem. One
direction of analysis is the ,invest·igatron of the
association between categorical .healthcstatus measures and
extent of air pollution, in order toccomplement the work
that has already been accomplished for the continuous
measures of the data. Health status is measured by
variables indicating whether or not there is a cold or
asthma recorded at one of nine time points in the study, and
pollution level is indicated by a three-point scale
calculated from the aerometric data collected as part of the
study. One type of the methodology under study will be
randomization techniques which under minimal assumptions
allow one to assess the strength of the relationship of a
20
response measure and evaluation measure while controlling
for the effects of confounding variables. Other methods to
be employed will be extensions of the weighted least squares
regression methodology outlined in Grizzle Starmer and Koch
(1969) which are appropriate for the repeated measures and
incomplete data aspects of the study. Some of these
strategies have only been applied to simplified illustrative
examples, so it is of interest to assess their
appropriateness when applied to a large health dataset.
The first stage of analysis will concentrate on
assessing the relationship between the response variable
(e.g. number of colds) and the evaluation variable (e.g.
pollution) through the use of first order association
statistics. Mantel-Haenszel strategies will be employed to
investigate whether these first order associations (if they
exist) are maintained after adjusting for potentially
confounding factors such as geographic area, sex, race, and
age. The second stage of this part of the analysis will be
to use weighted least squares techniques to model the
relationship of health status and pollution level across the
demographic configurations which display variation. This
relationship will also be modeled across time for those
subjects with complete data.
Another objective of this dissertation will be to
address the analysis of the partially complete data vectors
in this dataset. One strategy of interest is that of
multivariate ratio analysis (Stanish Koch and Landis 1978).
21
This involves the calculation of multivariate ratio
estimates of the means and a corresponding covariance matrix
estimate. For the data under study, this might consist of
mean colds per year. The variation among these means could
thus be analysed using asymptotic regression methodology.
Another method of interest is that of supplemental margins
(Koch Imrey and Reinfurt 1972). This is a linear models
technique in which the complete and incomplete data are
combined by considering the incomplete observations as
members of distinct subpopu1ations. These subpopulations
then contribute whatever information they contain to the
marginal proportions that are subsequently formed from the
data as a whole and analysed.
In conclusion, this dissertation will be concerned,.
with several aspects of the analysi~ of,a large longitudinal
dataset. One objective will be to determine if the
integrated set of relatively new catego~ica1 analysis
procedures are an appropriate resource with which to answer
substantive questions about the relationship between health
status and air pollution which the study sought to answer.
Another aspect will be the application of categorical
analysis strategies to partially complete data and the
evaluation of their usefulness, particularly in the context
of an especially large dataset. Finally, the extent to
which modifications to such procedures would make them more
helpful will also be evaluated.
NN
e e
TABLE 1.1AREA BY RACE BY SEX FOR 1973 DATA
I SEX I I1-----------------------------------------------------------------------1 11 MALE I FEMALE I SEX 11-----------------------------------+-----------------------------------+-----------------1I RACE I RACE 1 MALE 1 FEMALE I1-----------------------------------+-----------------------------------+---------+--------1I WHITE I BLACK IHISPANlcl OTHER I WHITE I BLACK IHISPANlcl OTHER I TOTAL I TOTAL I TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I N 1 N I N I N I N I N I N I N I N I N I N
----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA I I I I I I I I 1 1 I----------1 I I I I 1 I I I 1 ICHARLOTTE I 4421 2021 21 21 4621 2091 NONE I 1 I 64B I 6721 1320----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 10601 4881 21 31 10031 4451 11 41 15531 14531 30061----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1NVC I 2381 551 21 21 230 I 621 81 41 2971 3041 6011----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1UTAH I 7271 41 521 181 7111 21 791 191 8011 8111 16121----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI I I 8551 201 761 391 7871 201 771 331 9901 9171 19071----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI II I 6731 11 401 161 5651 31 501 121 7301 6301 13601----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1
1TOTAL I 39951 7701 1741 801 37581 7411 2151 731 50191 47871 98061
e
C""lN
TABLE 1.2AREA BY RACE BY SEX FOR 1974 DATA
I SEX I I I1-----------------------------------------------------------------------1 I II MALE I FEMALE I SEX I I1-----------------------------------+-----------------------------------+-----------------1 II RACE 1 RACE I MALE I FEMALE I I1-----------------------------------+-----------------------------------+--------+--------1 II WHITE 1 BLACK IHISPANICI OTHER I WHITE I BLACK IHISPANICI OTHER I TOTAL I TOTAL I TOTAL I1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1
liN I N I N I N I N I N IN 1 N I N I N IN 11----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IIAREA I I I I I I I I I I I 11----- -----I 1 I I I I I I I I I 1ICHARLOTTE 1 4201 2081 41 11 4231 2191 11 31 6331 6461 127911----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IIBIRMINGHAMI 13111 6431 NONE I 41 12381 6251 NONEI 61 19581 18691 382711----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IINYC I 3911 1291 181 91 3421 1381 141 51 5471 4991 10461l~---------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I
IUTAH I 7121 51 581 161 6791 31 711 161 7911 7691 156011----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IICALI I I 8111 181 741 371 7311 201 761 401 9401 8671 18071L----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IICALI II 1 6101 11 451 181 5091 31 421 151 6741 5691 124311:---------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IITOTAL I 42551 10041 1991 851 39221 10081 2041 851 55431 52191 107621
e e e
-:tN
e e
TABLE 1.3AREA BY RACE BY SEX FOR 1975 DATA
I SEX I 11-----------------------------------------------------------------------1 1I MALE 1 FEMALE I SEX I1-----------------------------------+-----------------------------------+-----------------11 RACE I RACE 1 MALE 1 FEMALE I1-----------------------------------+-----------------------------------+--------+--------11 WHITE I BLACK IHISPANIC! OTHER I WHITE I BLACK IHISPANICI OTHER ! TOTAL I TOTAL 1 TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I N I N I N I N I N 1 N I N I N I N I N I N
----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA I I 1 1 1 1 I I I I I__________ 1 I I 1 1 I 1 1 1 I 1CHARLOTTE 1 5671 2861 41 11 5561 3071 11 6! 8581 8701 1728----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 13281 6741 11 11 12731 6671 11 91 20041 19501 3954----------+--------+--------+--------+--------+--------+--------+--------+--------~--------+--~-----+--------
NYC I 3731 1131 151 91 3211 1221 121 101 5101 4651 975----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 7171 11 591 191 6731 31 701 141 7961 7601 1556----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI I I 9021 261 811 521 8271 181 761 491 10611 9701 2031----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI II I 6471 21 461 231 5401 31 411 141 7181 5981 1316i----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1TOTAL I 45341 11021 2061 1051 41901 11201 2011 1021 59471 56131 115601
e
U"lN
e
TABLE 1.4AREA BY RACE BY SEX FOR COMPLETE DATA
I SEX I I1-----------------------------------------------------------------------1 II MALE I FEMALE I SEX I1-----------------------------------+-----------------------------------+-----------------\1 RACE I RACE \ MALE I FEMALE I1-----------------------------------+-----------------------------------+--------+------~-II WHITE I BLACK IHISPANICI OTHER I WHITE I BLACK IHISPANICI OTHER I TOTAL I TOTAL I TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I N I N I N I N I N I N I N I N I N I N I N
----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA I I I I I I I I I I I----------1 I I \ I I I I \ I ICHARLOTTE \ 1361 601 21 11 1811 671 NONEI NONEI 1991 2481 447----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 3031 2131 NONEI NONEI 3771 2121 NONE I NONEI 5161 5891 1105----------+--------+--------+--------+--~-----+--------+--------+--------+--------+--------+--------+--------1NYC I 631 14\ NONEI 11 621 121 21 21 781 781 1561----------+--------+--------+--------+--------+--------+----~-.~-+--------+--------+--------+--------+--------IUTAH I 320\ NONEI 261 31 3501 NONEI 331 31 3491 3861 7351----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI I I 4511 171 401 141 3931 81 381 181 5221 4571 9791----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI II I 2601 NONEI 191 91 2601 NONEI 211 111 2881 2921 5801----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1
ITOTAL I 15331 3041 871 281 16231 2991 941 341 19521 20501 40021
e e
\0N
e -TABLE 1.5
AREA BY POLLUTION INDEX BY SEX FOR 1973 DATA
I POLLUTION INDEX I 11-----------------------------------------------------I I1 LOWER I MIDDLE I HIGHER I POLLUTION INDEX 11-----------------+-----------------+-----------------+--------------------------1I SEX I SEX I SEX I LOWER I MIDDLE 1 HIGHER 11-----------------+-----------------+-----------------+--------+--------+--------1I MALE 1 FEMALE 1 MALE 1 FEMALE I MALE I FEMALE 1 TOTAL I TOTAL 1 TOTAL I TOTAL1------7-+--------+--------+--------+--------+--------+--------+--------+--------+--------
1 I N I N 1 N I N 1 N I N 1 N 1 N I N 1 N1----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA I I I I I 1 I 1 I 1----------1 I 1 I 1 I I I I 1CHARLOTTE I 3961 4051 2521 2671 NONEI NONEI 8011 5191 NONEI 1320--~-------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------
BIRMiNGHAMI 9761 9621 5771 4911 NONEI NONE I 19381 10681 NONEI 30061----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1NYC I 1081 1161 NONEI NONEI 1891 1881 2241 NONEI 3771 6011----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1UTAH I 3901 3891 4111 4221 NONEI NONEI 7791 8331 NONEI 16121----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI I 1 3071 2861 2181 1971 4651 4341 5931 4151 8991 19071----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI I I I 2301 2031 NONE 1 NONE 1 5001 4271 4331 NONE I 9271 1360 I----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1TOTAL 1 24071 23611 14581 13771 11541 10491 47681 28351 22031 98061
e
,...N
TABLE 1.6AREA BY POLLUTION INDEX BY SEX FOR 1974 DATA
1 POLLUTION INDEX 11-----------------------------------------------------1
I~ 1
1 LOWER 1 MIDDLE 1 HIGHER 1 POLLUTION INDEX 11-----------------+-----------------+-----------------+--------------------------11 SEX I SEX I SEX I LOWER I MIDDLE I HIGHER I1-----------------+-----------------+-----------------+--------+--------+--------1I MALE 1 FEMALE I MALE I FEMALE 1 MALE I FEMALE I TOTAL 1 TOTAL 1 TOTAL 1 TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I N I N I N 1 N 1 N I N I N I N 1 N 1 N
----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA 1 I 1 I I I I I 1 1----------1 1 I I I I I I 1 1CHARLOTTE 1 3161 3261 3171 3201 NONE 1 NONE I 6421 6371 NONE I 1279----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 12591 11991 6991 6701 NONEI NONEI 245BI 13691 NONEI 3827----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------NYC I 2121 1641 NONEI NONEI 3351 3351 3761 NONEI 6701 1046----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 3811 3681 4101 4011 NONEI NONEI 7491 8111 NONEI 1560----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------
iCALI I I 2801 2651 2091 1851 4511 4171 5451 3941 '8681 18071----------+--------+--------+--------+--------+--------+--------+--------~--------+--------+--------ICALI II 1 2281 1811 NONEI NONEI 4461 3881 4091 NONEI 8341 12431----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+~-------I TOTAL 1 26761 25031 16351 15761 12321 11401 51791 32111 23721 10762
e e e
IX)N
e e
TABLE 1.7AREA BY POLLUTION INDEX BY SEX FOR 1975 DATA
1 POLLUTION INDEX I I1-----------------------------------------------------I II LOWER I MIDDLE I HIGHER I POLLUTION INDEX I1-----------------+-----------------+-----------------+--------------------------1I SEX I SEX I SEX I LOWER I MIDDLE I HIGHER I1-----------------+-----------------+-----------------+--------+--------+--------1I MALE I FEMALE I MALE I FEMALE I MALE I FEMALE I TOTAL I TOTAL I TOTAL I TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I N I N I N I N I N I N IN 1 N I N IN
----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA I I I I I I I I I I----------1 I I I I I I I I ICHARLOTTE I 3031 3281 5551 5421 NONE I NONE I 631 I 10971 NONE I 1728----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 17681 12241 7361 7261 NONEI NONEI 24921 14621 NONEI 3954----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------NYC I 2101 1691 NONE I NONEI 3001 2961 3791 NONEI 5961 975----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 3851 3631 4111 3971 NONEI NONEI 74BI 8081 NONEI 1556----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI I 1 3241 3011 2251 1941 5121 4751 6251 4191 9871 2031----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI II 1 2481 1981 NONEI NONEI 4701 4001 4461 NONEI 8701 1316----------+--------+--------+--------+--------+--------+--------+--------~--------+--------+--------
e
TOTAL 27381 25831 19271 18591 12821 11711 53211 37861 24531 11560
0'\N
e
TABLE 1.8AREA BV POLLUTION INDEX BV SEX FOR COMPLETE DATA
I POLLUTION INDEX 1 I I1-----------------------------------------------------I I I\ LOWER 1 MIDDLE I HIGHER I POLLUTION INDEX I I1-----------------· ----------------+-----------------+--------------------------1 II SEX 1 SEX I SEX \ LOWER I MIDDLE I HIGH~R I II------------~----+-----------------+-----------------+--------+--------+--------1 II MALE I FEMALE I MALE I FEMALE I MALE I FEMALE I TOTAL 1 TOTAL I TOTAL I TOTAL I1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1I N I N I N 1 N I N I N I N 1 N I N I N
----------+--------+--------+--------+--------+---~ --+--------+--------+--------+--------+--------AREA I I I 1 I 1 I I I I----------1 I I I I I 1 I I ICHARLOTTE I 1141 1261 851 1221 NONEI NONEI 2401 2071 NONEI 447----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 3361 4321 1801 1571 NONEI NONEI 7681 3371 NONEI 1105----------+--------+--------+--------+--------~--------+--------+--------+--------+--------+--------
NVC I 211 211 NONEI NONE I 571 571 421 NONEI 1141 156----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH. 1 1881 2081 1611 1781 NONEI NONEI 3961 3391 NONEI 735
1----------+--------+--------+--------+--------+---------+--------+--------+--------+--------+--------ICALI I 1 1521 1321 931 851 2771 2401 2841 1781 5171 97911----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1ICALI II I 1071 1001 NONEI NONEI 181\ 1921 2071 NONEI 3731 58011----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1ITOTAL I 9181 10191 5191 5421 5151 4891 19371 10611 10041 40021
e e
oM
e -TABLE 1.9
MEAN COLOS IN 1973 BY AREA BY SEX BY RACE
e
SEX I I-----------------------------------------------------------------------1 I
MALE I FEMALE I SEX I-----------------------------------+-----------------------------------+-----------------1
RACE I RACE I MALE I FEMALE I-----------------------------------+-----------------------------------+--------+--------1
WHITE I BLACK IHISPANICI OTHER I WHITE I BLACK IHISPANICI OTHER 1 TOTAL I TOTAL I TOTAL I--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1
COLDS I COLDS 1 COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1
MEAN 1 MEAN I MEAN I MEAN 1 MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1AREA I I I I I 1 I I I I I I
1----------1 I I I I I I I I I 1 ICHARLOTTE I 0.861 0.671 0.501 0.501 1.021 1.161 ----I 1.001 0.801 1.061 0.94----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 0.671 0.511 0.501 0.331 0.951 0.891 1.001 1.751 0.621 0.931 0.77----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------NYC I 0 . 631 0 . 471 0 . 00 I 1 .00 I O. 751 0 . 581 0.50 I 0 . 50 I 0.60 I O. 70 I 0 . 65----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 0.691 0.751 0.421 0.441 1.001 0.501 0.711 1.111 0.661 0.981 0.82--------~-+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------
CALI I I 0.581 0.401 0.431 1.001 0.791 0,401 0.681 0.64\ 0.581 0.771 0.67----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI II I 0.721 0.001 0.551 0.311 0.931 0.331 0.721 0.17\ 0.701 0.901 0.79----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------TOTAL I 0.681 0.551 0.451 0.701 0.921 0.931 0.691 0.741 0.651 0.911 0.78
.....C"'l
TABLE 1.10MEAN COLDS IN 1974 BY AREA BY SEX BY RACE
SEX 1 1-----------------------------------------------------------------------1 1
MALE I FEMALE 1 SEX 1-----------------------------------+-----------------------------------+-----------------1
RACE I RACE I MALE I FEMALE 1-----------------------------------+-----------------------------------+--------+--------1
WHITE 1 BLACK IHISPANlcl OTHER I WHITE 1 BLACK IHISPANlcl OTHER 1 TOTAL I TOTAL 1 TOTAL--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------
COLDS I COLDS 1 COLDS 1 COLDS 1 COLDS I COLDS I COLDS I COLDS 1 COLDS 1 COLDS I COLDS--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------
MEAN I MEAN I MEAN I MEAN 1 MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA 1 1 I I 1 1 I 1 I 1 I----------1 I 1 1 I 1 1 1 I I ICHARLOTTE I 0.781 0.791 0.251 1.001 1.101 1.051 1.001 1.001 0.781 1.081 0.93----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 0.771 0.611 ----I 0.751 0.971 1.031 ----I 0.671 0.721 0.991 0.85----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------NYC I 0.571 0.521 0.501 0.221 0.741 0.591 0.931 0.201 0.551 0.701 0.62----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 0.801 0.401 0.481 0.501 0.991 1.331 0.621 0.631 0.771 0.951 0.861----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI I I 0.681 0.671 ·0.431 0.891 0.831 0.401 0.711 0.451 0.671 0.791 0.731----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI II 1 0.721 1. 00 1 0.331 0.50 I 0.991 0.671 0.741 0.40 I 0.691 0.951 0.811----------T--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1TOTAL I 0.741 0.631 0.431 0.661 0.941 0.961 0.701 0.491 0.701 0.931 0.811
e
---------------------------
,
-------------------------------------------
e e
NM
e e,
TABLE 1.11MEAN COLDS IN 1975 BY AREA BY SEX BY RACE
SEX I-----------------------------------------------------------------------1
e
C"')C"')
TABLE 1.12MEAN COLDS BY AREA BY SEX BY RACE FOR COMPLETE DATA
SEX I-----------------------------------------------------------------------1
I1
MALE I FEMALE I SEX I-----------------------------------+-----------------------------------+-----------------1
RACE I RACE I MALE I FEMALE 1-----------------------------------+-----------------------------------+--------+--------1
IIIIII
WHITE 1 BLACK IHISPANICI OTHER I WHITE I BLACK IHISPANICI OTHER I TOTAL 1 TOTAL I TOTAL 1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1
COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1
MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I---------~+-----~--T--------+--------+--------+--------+--------+--------+--------+--------+--------+--------
AREA I I I I I I 1 I I I 1----------1 I I I I I I I I I 1CHARLOTTE I 0.891 0.801 0.501 0.001 1.101 1.151 ----I ----I 0.851 1.111 1.00----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 0.631 0.371 ----I ----I 0.921 0.861 ----I ----I 0.521 0.901 0.72----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------NYC I 0.521 0.361 ----I 1.001 0.711 0.581 1.501 0.501 0.501 0.711 0.60----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 0.671 ----I 0.381 0.331 0.951 ----I 0.731 0.671 0.641 0.931 0.79----------+--------+--------+--------+--------+--------+--------+--------+---_._---+--------+--------+--------CALI I I 0.591 0.411 0.571 1.141 0.821 0.251 0.681 0.561 0.601 0.791 0.69----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI II 1 0.741 ----I 0.631 0.441 0.971 ----I 0.861 0.181 0.721 0.931 0.83----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------
jTOTAL I 0.661 0.451 0.531 0.791 0.921 0.901 0.761 0.441 0.631 0.901 0.77i
e e e
"""C"'l
e e
TABLE 1.13MEAN COLDS BY POLLUTION INDEX BY SEX BY RACE FOR 1973 DATA
SEX I-----------------------------------------------------------------------1
MALE I FEMALE I SEX 1-----------------------------------+-----------------------------------+-----------------1
RACE I RACE I MALE I FEMALE I-----------------------------------+-----------------------------------+--------+--------1
WHITE 1 BLACK IHISPANICI OTHER I WHITE 1 BLACK IHISPANICI OTHER 1 TOTAL I TOTAL 1 TOTAL--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------
COLDS 1 COLDS 1 COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS 1 COLDS 1 COLDS--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------
MEAN 1 MEAN 1 MEAN I MEAN I MEAN 1 MEAN 1 MEAN I MEAN I MEAN I MEAN I MEAN----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------POLLUTION 1 I I I I I I 1 I 1 IINDEX 1 1 I I I I 1 I I 1 I----------1 1 I I I 1 1 1 I I 11 1 0.671 0.551 0.481 0.731 0.911 0.861 0.751 0.851 0.651 0.891 0.77----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------2 1 0.731 0.551 0.261 0.651 1.041 0.971 0.631 1.001 0.661 1.001 0.82----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------13 I 0.661 0.451 0.571 0.711 0.841 0.961 0.671 0.32\ 0.651 0.821 0.731----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1
ITOTAL I 0.681 0.55( 0.451 0.701 0.921 0.931 0.691 0.741 0.651 0.911 0.781
e
LI)C"'l
e
TABLE 1.14MEAN COLDS BY POLLUTION INDEX BY SEX BY RACE FOR 1974 DATA
SEX I. II------------------------------------~----------------------------------1 II MALE I FEMALE 1 SEX I1-----------------------------------+-----------------------------------+-··---------------1I RACE I RACE 1 MALE I FEMALE I1-----------------------------------+------------------------------~----+--------+--------II WHITE I BLACK IHISPANICI OTHER I WHITE I BLACK IHISPANICI OTHER I TOTAL I TOTAL I TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1 COLDS 1 COLDS 1 COLDS I COLDS I COLDS I COLDS I COLDS I COLDS 1 COLDS I COLDS 1 COLDS1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I MEAN I MEAN 1 MEAN I MEAN 1 MEAN I MEAN I MEAN I MEAN 1 MEAN I MEAN I MEAN
----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------POLLUTION 1 I 1 I 1 I I I I I 1INDEX 1 I 1 I 1 I I I I 1 I----------1 1 I 1 I I I I 1 1 I1 I 0.771 0.611 0.461 0.381 0.961 0.821 0.631 0.611 0.731 0.931 0.83----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------2 1 0.711 0.641 0.441 0.701 0.941 1.061 0.661 0.621 0.671 0.971 0.82i----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------13 1 0.701 0.741 0.381 0.841 0.91 I 0.781 0.811 0.251 0.681 0.881 0.781----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1TOTAL I 0.741 0.631 0.431 0.661 0.941 0.96\ 0.701 0.491 0.701 0.931 0.811
e, e
\0C"'l
eII
e
TABLE 1.15MEAN COLDS BY POLLUTION INDEX BY SEX BY RACE FOR 1975 DATA
SEX I I-----------------------------------------------------------------------1 I
MALE 1 FEMALE I SEX I-----------------------------------+----------------- ------------------+-----------------1
RACE I RACE I MALE I FEMALE I-----------------------------------+-----------------------------------+--------+--------1
WHITE 1 BLACK IHISPANICI OTHER I WHITE 1 BLACK IHISPANICI OTHER I TOTAL I TOTAL I TOTAL--------+--------+--------+--------+--------+--------+ --------+--------+--------~--------+--------
COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS 1 COLDS 1 COLDS 1 COLDS--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------
MEAN 1 MEAN I MEAN I MEAN I MEAN I MEAN I MEAN 1 MEAN I MEAN 1 MEAN 1 MEAN----------+--------+--------+--------+--------+--------+--------+--------+--------t--------+--------+--------POLLUTION I 1 I I I 1INDEX 1 1 I 1 I I----------1 I I I 1 I1 -I 0.751 0.581 0.451 0.931 0.961 0.781 0.631 0.691 0.721 0.921 0.82----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------2 I 0.711 0.481 0.521 0.711 1.001 0.831 0.781 0.741 0.621 0.921 0.77----------+--------+--------+--------+--------+--------+--------+--------~--------+--------+--------+--------I
3 I 0.671 0.411 0.521 0.751 0.901 0.881 0.641 0.831 0.661 0.881 0.761----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1TOTAL I 0.721 0.511 0.491 0.791 0.951 0.821 0.671 0.751 0.671 0.911 0.791
e
t-("I")
e
TABLE 1.16PROPORTION OF COLDS BV POLLUTION BY SEX FOR COMPLETE DATA
I FALL 1 WINT I SPR I FALL I WINT I SPR I FALL I WINT I SPRI 72 I 73 I 73 1 73 1 74 I 74 I 74 75 1 751------+------+------+------+------+------+------+------+------1 PROP I PROP I PROP 1 PROP I PROP I PROP I PROP I PROP I PROP
---------------------+------+------+------+------+------+------+------+------+------POLLUTION ISEX 1 I I 1 I I I I II NDEX I I I I 1 I I I I 1----------+----------1 1 I I 1 I 1 I I1 IMALE I 0.201 0.231 0.191 0.231 0.311 0.231 0.251 0.301 0.20
1----------+------+------+------+------+------+------+------+------+------IFEMALE I 0.281 0.351 0.271 0.271 0.411 0.301 0.311 0.361 0.28
----------+----------+------+------+------+------+------+------+------+------+------2 IMALE 10.1910.2310.1610.2010.2610.211 0.201 0.221 0.19
1----------+------+------+------+------+------+------+------+------+------IFEMALE I 0.311 0.371 0.291 0.331 0.361 0.311 0.321 0.371 0.28
----------+----------+------+------+------+------+------+------+------+------+------3 IMALE I 0.251 0.221 0.191 0.181 0.311 0.211 0.221 0.251 0.22
I ,----------+------+------+------+------+------+------+------+------+------1 1FEMALE 1 0.281 0.311 0.251 0.251 0.391 0.291 0.291 0.341 0.281---------------------+------+------+------·------+------+------+------+------+----~-
I TOTAL 1 0.251 0.291 0.231 0.251 0.351 0.261 0.271 0.311 0.24
e e,
CHAPTER II
RANDOMIZATION TESTS METHODOLOGY
2.1 Introduction
The analysis of the CHESS dataset will illustrate a
two stage approach in categorical data methodology. The
first stage involves the application of randomization test
strategies to assess the extent of association in the data
between the response variables (e.g., number of colds) and
evaluation variables (e.g., pollution index,area),
controlling for the effect of potential confounding
variables (e.g. sex, age). Most of these analyses will
involve the use of Mantel-Haenszel type statistics as
reviewed in Landis et al. (1978). Chapter II will be
concerned with a discussion of such randomization
techniques. Sections 2.3.2-2.3.3 discuss the randomization
statistic and its use in evaluating average partial
association. The mean score test and correlation test
extensions which can be utilized if the response or both
response and evaluation variables are ordinally scaled are
outlined in sections 2.3.4-2.3.5. The final section in this
chapter involves a further extension of randomization
statistics when there are multiple response variables. The
use of randomization statistics in variable selection
processes is described and illustrated in a later chapter.
39
The other phase of analysis is that of modeling the
variation among a set of estimates produced from the data
through the use of weighted least squares techniques. The
methodology of Grizzle, Starmer, Koch (1969) for the linear
model analysis of categorical data is reviewed in Chapter
III. Examples of appropriate functions to model such as
mean scores are included, as well as a section on useful
analysis strategies for modeling the variation among
estimates derived from repeated measurements designs.
Finally, methods for the analysis of incomplete data are
given, including ratio estimation and supplemental margins.
First, the sections on data analysis methodology are
preceded with a discussion of the implications of the
research design under which data are obtained.
2.2 Research Design Implications
The nature of the statistical methodology applied to a
particular dataset is inherently linked with the research
design (or lack thereof) which gave rise to the data. The
analytical methods used as well as the interpretation of
their results and their generalizability to some extended
population are a function of the research sampling process
employed. Research data usually fall into one of the three
following design schemes:
1.0bservational (historical) data from subjects in astudy population having a natural (i.e.,geographical, temporal) definition.
2.Data from an experimental design situation in whichsubjects are randomly allocated to differenttreatments.
40
3.Sample survey data where subjects are randomlyselected from a larger study population.
Observational studies can include total population studies,
retrospective studies, and prospective in retrospect
studies. The CHESS dataset would be classified as a
prospective observational dataset.
The following discussion is in the spirit of Koch,
Gillings and Stokes (1980), and Koch and Gillings (1983).
The type of research design determines if and to what extent
two types of statistical strategies are applicable to the
research data. One is directed at the interpretation of
what is observed for the local population which the subjects
directly represent. The other is an extended population
analysis which generalizes to a larger target population
when certain assumptions about the sampling process can be
made. The local population coverage for a particular
research design depends on the degree to which randomization
was involved. A clinical trial experimental situation
involving fifty randomly allocated subjects may enable one
to make inferences to a local population which consists of
all the patients who had a known probability of receiving
either of two treatments. A sample survey data analysis
usually applies to a' large-scale local population due to the
nature of the sampling design employed. Observational data,
with its inherent lack of randomization, may only qualify
for a local population analysis with results relevant to the
observed population only; the subjects studied are only
representative of themselves.
•
41
In general, one would like to make inferences to a
more extensive population than the local population.
However, various sorts of assumptions may be required to
establish' the basis of generalization. One requirement is
that sample coverage is complete, i.e. all relevant
subpopulations are included in the sample. Another is that
any outcome differences within sUbpopulations be equivalent
to random variation. This has been termed the homogeneous
stratification assumption, and in stricter terms assumes
that the data are equivalent to a stratified random sample
for the target population in which the strata include
subjects partitioned according to an appropriate set of
explanatory variables. If the stratified population
structure is also assumable, then the framework is
equivalent to probability distributional sampling and the
appropriate statistical methods can be applied.
2.3 Randomization Test Methods
2.3.1 First Order Association
The first step in the analysis in this dissertation
involves an assessment of the association between the
outcome variables and the evaluation variable, and the
association between the outcome variables and the
demographic variables. The observational nature of the data
makes it reasonable to address this question in terms of the
following randomization hypothesis of no association:
Ho: There is no relationship between the subpopulationsand the response variable in the sense that theobserved partition of the response values into thesubpopulations can be regarded as equivalent to
42
a successive set of simple random samples.
The categorical data can be described in terms of the
general contingency table illustrated in Table 2.3.1.
Table 2.3.1Observed Contingency Table
Response Variable CategoriesSubpopulation 1 2 r
1 Y11 Y12 Y1r2 Y21 Y22 Y2r
s YS1 Ys2 Ysr
Total Y*l Y*2 Y*s
Total
Here, Yij denotes the number of subjects in the i-th
subpopulation who have the j-th response, with i=1,2, ... s
and j=1,2, •.. r; Yi * denotes the marginal total number of
subjects classified as being in the i-th subpopulation, Y*j
denotes the marginal total number of subjects classified as
belonging to the j-th response category, and Y.* denotes the
total sample size. Under the condition that both sets of
marginal totals are fixed, the randomization hypothesis can
be re-stated:
Ho : The response variable is distributed at random withrespect to the subpopulations,i.e., the data in therespective rows of the table can be regarded as asuccessive set of simple randomsamples of sizes {y *} from a fixed populationcorresponding to thf marginal total distributionof the response variable {Y*j}.
Stratified sampling arguments can be used to show that, on
follows the product hypergeometric distribution given by:
43
(2.3.1) Pr{y} =s rn Yi *! n Y*j!
i=l j=l
For observational data, these margins are fixed due to the
inherent nature of the hypothesis itself, and the
hypergeometric distribution assumption follows. With
experimental data, the row margins would be fixed by the
treatment allocation process, and the column margins fixed
by the hypothesis. Thus, the Y do not follow the
hypergeometric distribution when the null hypothesis is
rejected, and the implied hypergeometric model can not be
used as the basis for additional analysis. When the
hypergeometric model holds,
( 2 . 3 . 2 )
( 2 . 3 . 3 ) COV{Yij'Yi'j,}IHO } = V1j ,i'j'
= Yi*Y*j ( 6ii'Y**-Yi,*)(6jj'Y**-Y*j')2
Y** (y**-l)
where 6 kk ,= 1 if k=k'
= 0 if k~k'
When the sample size is large enough, mij ~ 5, the
44
vector y has an approximate multivariate normal distribution
by randomization central limit theory with covariance matrix
V. (See Puri and Sen (1971», where {vij } are the elements
of V. The quadratic form
Q=(y-m}IA I [AVA]-lA(y-m)
thus represents a randomization test statistic which is
approximately chi-square with degrees of freedom = Rank {A}
when A is specified so that AVA' is non-singular. The
particular matrix
generates (s-l) (r-1) linear functions from (y-m) which are
the differences between the observed and expected counts for
the first (r-1) cells of the first (s-l) sUbpopulations. In
this case, Q would have (s-1)(r-1) degrees of freedom. The
statistic Q can be shown to have the form
y ••-1Q =
(2.3.4)
= y••-l
where Qp is the familiar Pearson chi-square statistic.
Thus, for large samples, Q and Qp are asymptotically
equivalent. When the sample size is very small, the
45
significance of Q can be evaluated with exact methods which
yield Fisher's Exact Test for 2 x 2 tables.
2.3.2 Partial Association
The meaningfulness of a significant first order
association is limited, especially in an observational data
setting. Apparent associations may be due to the effects of
other variables which have not been taken into account in
the analysis. In addition, controlling for the effects of
such variables often leads to the uncovering of an effect
which was not evident in a first order association sense.
In epidemiological literature, this phenomena is known as
confounding, defined by Breslow and Day (1980) as
" ... the distortion of a disease/exposure associationbrought about by the association of other factorswith both disease and exposure, the latterassociation with the disease being causal."
Disease status can be considered the response variable,
while exposure variables can be considered evaluation
variables. Thus, confounding variables must be taken into
account when one is attempting to assess the relationship
between a response variable and evaluation variable. One of
the most frequently used methods to control for these
confounders is to partition the data into a set of strata
which are internally homogeneous with respect to the
confounding variables and then calculate a measure that
summarizes the associations within the strata as a whole.
Randomization model methods are an appropriate strategy for
accomplishing this within the context of hypotheses
corresponding to randomized allocations for the distribution
46
of the response variable across subpopulations for the
evaluation variable.
With categorical data, the confounder variable being
controlled will consist of a number of categories which
correspond to covariate levels such as hospital,
geographical area, or age group. These will be the basis of
stratification. Thus, the data can be expressed in terms of
q:s x r contingency tables where q is the number of
covariate levels. The hypothesis of interest can then be
stated in terms of no partial association between the
response variable and the evaluation variable after
adjusting for the possible effects due to the confounders.
Cochran proposed a test statistic for the hypothesis of no
partial association in q: 2 x 2 tables in 1954 (Cochran
1954). His statistic was based on a mean difference
weighted across the q tables. The evaluation of this chi
square statistic's significance required aSYmptotic
assumptions for a binomial model which requires moderately
large sample sizes for each of the strata (~20). In 1911,
Hopkins and Gross proposed a generalization of this
procedure to q: s x r tables (Hopkins and Gross 1911).
Mantel and Haenszel (1959) noted that the same problem
could be addressed within the context of a hypergeometric
framework which required only that the total sample size
(across tables) be large enough for asymptotic methods to
. apply. Their method utilized expected values and variances
for a pivot cell in each of the 2 x 2 tables, but is
47
invariant among which of the cells is selected. When sample
sizes are appropriately large, the Cochran and Mante1
Haensze1 methods lead to equivalent results, as the
statistics differ by the factor (Yh**-l)/Yh** where Yh ** is
the sample size of the h-th stratum. In the same paper,
Mantel and Haensze1 also outlined a procedure which would
examine the hypothesis of no partial association in a set of
q:2 x r tables. The method consists of calculating the
expected values and covariances of (r-1) pivot cells for
each of the tables using the hypergeometric model as before.
These -quantities are summmed and the hypothesis tested via a
quadratic form test statistic. However, the details were
only outlined in the (2 x 3) contingency table setting.
In 1963, Mantel discussed the situation involving
q: s x r tables where the response variable is ordinally
scaled with progressively larger intensities. He discussed
scoring systems and developed a test statistic which was a
function of the subpopulation mean scores. He suggested a
correlation type statistic if the evaluation variable was
also ordinally scaled. The test statistic based on (s-l)(r
1) pivotal cells has been outlined and used in data
analysis, as illustrated by Koch and Reinfurt (1974) with
highway safety data. Landis; HeYman, and Koch (1978) have
discussed the methods with which one can evaluate average
partial association in contingency tables and also tie
together the Cochran-Mantel-Haensze1 approach to the
analysis of q:s x r tables with a general notation and
48
matrix formulation. Koch, Imrey et al. (1985) also present
randomization statistics which apply in the analysis of
categorical data in a comprehensive manner. The next
sections detail the existing randomization methodology.
2.3.3 Average Partial Association Methodology
Let h=1,2, ... q index a set of q: s x r contingency
tables where h corresponds to a distinct level of a
stratification variable or combination of such variables.
The general table outlined in Table 2.3.1 can represent the
h-th such table, where s subpopulations are to be compared
with respect to a response variable which has r outcome
categories. All of the table entries should be considered
to have an 'hI appended to the beginning of the subscript.
Thus, let Yhij denote the number of subjects who are
classified as belonging to the h-th stratum,i-th
sUbpopulation and j-th response category. Let Yhi* denote
the marginal total number of subjects who are classified as
belonging to the i-th subpopulation for the h-th stratum and
Yh*j denote the marginal total number of sUbjects who are
classified into the j-th response level.
The hypothesis being investigated is the randomization
hypothesis of no association and is stated as follows:
Ho : For each stratum h=1,2, ... q, the responsevariable is distributed at random across thesUbpopulations; i.e. the data {y } for therespective sUbpopulations arose g,jsuccessivesets of stratified simple random samples of sizes{Yhi *} from the finite population distributions{Yh*j} for the q strata.
By the above hypothesis, the {Yhij} have the product
49
multiple hypergeometric distribution.
(2.3.5) Pr{vIHo} =qn
h=1
s rn yhi * ! n y h * j !
i=1 j=1
where Y=(Y1 ' ,Y2 ', ... Yq ')'. As reviewed in section 2.3.1,
the Yh have independent approximate multivariate normal
distributions. The mean vector and corresponding covariance
matrix are expressible as
2= Y**
( 2 . 3 . 6 )
rII
(yh **-1) IL
,I
D - Ph *Ph *'1 ePh. * . . I
oJ
r ,I IID -Ph* Ph* 'I =I Ph* . . IL· oJ
where Ph*.=(Yh1*'Yh2*' ... Yhr*)'/Yh** and
Ph.*=(Yh *1'Yh *2'···Yh *r)'/Yh **
DA represents a diagonal matrix with elements of the vector
A on the main diagonal, and e denotes Kronecker product
multiplication.
Consider the situation where both the response
variable and evaluation variable are measured on a nominal
scale. Ho can be tested in terms of the sum of the
individual approximate chi-square statistics Q for each of
the q strata, i.e.
(2.3.7)q
QTR == I:h==l
(y -m )' A' (AV A') -1A(y -m )h h h h h
50
where A==[I(r_1),O(r_1)] e [I(S-l),O(S-l)] as before. QTR is
approximately chi-square with d.f.==q(r-1)(s-1) and is termed
the total partial association statistic. There must be
sufficient individual stratum sample sizes in order that QTR
be appropriate.
Ho can also be investigated in terms of the sums of
corresponding differences between the observed and expected
frequencies across the q strata in an extension of the
Mantel-Haenszel methodology. Let
q qG == I:A(y -~); hence Var{GIHo }= I: AVhA'
h=l h h=l
Thus, an appropriate test statistic is written
(2.3.8) QAR ==
r , r , -1 r ,I q II q I I q II L (y -m )' A' I I L (AV A' I I L A(y -m )' IIh=l h h Ilh=l h I Ih==l h h IL .IL. .I L. .I
and is called the average partial association statistic. It
is approximately chi-square with d.f.=Rank{A} and is
effective in assessing the extent to which Ho is
contradicted by a consistent pattern of association of
response and evaluation variables across the strata.
An advantage of QAR over QTR is its less stringent
51
sample size requirements. Its asymptotic distribution
depends on the overall sample size y*** = LYh ** not each of
the individual Yh**' Also, since QAR's significance is
evaluated·with respect to fewer degrees of freedom, there is
a potential gain in power. It should be noted that QAR has a
narrow alternative to Ho in comparison with QTR' While the
null hypothesis of no difference in evaluation groups
applies to both statistics in this situation, the
alternative hypothesis for QAR can be stated as there is a
consistent difference. Thus, QAR may fail to detect a
difference when the differences are equally balanced. This
has been called the broad hypothesis, narrow alternative
property.
When s=r=2 in the case of a set of q: 2 x2 tables,
the average partial association (Mantel-Haenszel) statistic
becomes
QAR (Y*ll2= - m*ll)
q(2.3.9) ~ v h11h=l
q
Yh1*Yh2* 2~ (Ph11-Ph21)
h=lYh **
=q
~Yh1 *Yh2*.Yh*lYh*2
2
h=lYh ** (Yh**-l)
52
QAR is the same as the 1959 Mantel-Haenszel statistic, and
is approximately chi-square with 1 d.f.
2.3.4 Mean Score Test
When the response variable is ordinally scaled with
progressively larger intensities, Ho can be tested in terms
of the variation of stratum mean scores among the
subpopulations. Let ah'=(ah1,ah2, ... ahr) represent a set of
scores for the h-th table which characterize the relative
status of the response variable levels in some way. Let
(2.3.10)
be the mean score for the i-th subpopulation in the h-th
stratum. The vector of mean scores Yh**=(Yh1*'Yh2*'···Yhs*),
can be expressed as
(2.3.11) Vh *- =1 -1 r ,
D la' 0 lsi YhP L h .I
Yh ** h*.
va,h
By (2.3.6),
(2.3.12)
where
r -1 ,I D - 1 1 'I
)L s S .I
(yh **-l Ph *.
(2.3.14)r 2
va ,h=j:1 (ahj-~a,h) (Yh*j/Yh **)
53
which is the finite population variance for all subjects in
the h-th stratum.
Let C=[I(S_1),-1(s_1)] denote a matrix which will
compare the first (s-1) subpopulations with the s-th.
E{CYh**IHo } = 0(S-1)
v a,h r -1 ,I CD C'IL Ph * . .I
When sample sizes for the subpopulations within the strata
are large enough(e.g. Yhi* ~ 30), the Yh** will have
independent approximate multivariate normal distributions
via randomization central limit theory. Thus, the
statistics
(2.3.16) QRS,h, -1 ,
C'I CYh ** / v hiP ol a, .I
h* .
have approximate chi-square distributions with d.f.=(s-1),
and
(2.3.17)q
QTRS= h:1 QRS ,h
has an approximate chi-square distribution with d.f.=q(s-1).
It should also be noted that QRS,h can be verified to have
54
the analysis of variance form
(2.3.18) QRS,h
r sI
= (Yh**-l)1 LILi=l
The Mantel-Haenszel methodology can be extended to
encompass this mean score situation as well, by focusing on
the sum of the differences between the subpopulation mean
scores and their expected values under Ho • Let
be such a sum, where ~.C[ah' e Is] and C=[I(S_l),-l(s_l)]
as defined above.
E{GIHo} = 0
(2.3.19)
Thus, the randomization mean score statistic to detect
average partial association can be written
q q I -1(2.3.20) QARS ={ L(y -m )'Ah '}{ L AhVhAh )
h=l h h h=l
q{L Ah(Yh-mh)}h=l
where
Ah is defined as above. QARS is approximately chi-square
55
with d.f.=(s-l), and is directed at location shift
alternatives in the sense that it assesses the extent to
which mean scores in one subpopulation exceed (or are
exceeded by) the mean scores in the others. QARS is often
termed an analysis of variance test statistic, and as the
case of QAR' its asymptotic properties are linked to the
total combined strata sample sizes rather than to the
individual strata sizes.
Typically, applied scores include integer scores, rank
scores, ridit scores, binary scores and logrank scores.
Integer scores are written a hj = j=1,2, •.. r and are of
interest when the response levels are considered equally
spaced. Binary scores are those such as a hj = 1 for
j=1,2, ... k and a hj = 0 for j=(k+1)""r and are useful when
the focus is on comparing a certain combination of the
levels of the response variables with the others. They
collapse the response levels into a dichotomous variable.
Rank scores are defined as
(2.3.21)j-1
= { L Yh*k} + {(Yh*j + 1)/2}k=l
and are useful for ordinal data; midranks are used for
ties. Ridit scores are rank scores which are divided by the
strata sample sizes Yh**' When rank scores are applied,
QARS is equivalent to a partial Kruskal-Wallis statistic.
Logrank scores are written as
(2.3.22)
56
and are of interest for L-shaped distributions and other
distributions when differences in subpopulations are more
apparent in the upper response categories than the lower.
2.3.5 Correlation Test
An additional set of possible test statistics for Ho
occurs when both the response and subpopulations are
ordinally scored with progressively larger intensities.
Specifically, one can focus on the product score functions
(2.3.22)
where the {chi} characterize the relative status of the
subpopulation variable levels much as the {ahj } do for the
response variable for each of the q strata. (2.3.1) and
(2.3.2) imply
r , r ,I s II r I
(2.3.23) E{FhIHo} • I L chiYhi*/Yh**1 I L ahjYh*j/Yh**1li=l Ilj=l IL .I L .I
=- , JJc,h a,h
57
va,~
(Yh** - 1)
= vc,hva,h I (Yh ** - 1)
Suffic~ent stratum sample sizes {Yh **} (~ 30) provide for
randomization central limit theory to imply that the {Fh }
have independent normal distributions with (2.3.23) and
(2.2.24) as their expected values and variances,
respectively. A correlation test statistic can hence be
calculated as
r/ s
(yh **-l)1 ~li=lL
= (Yh**-l)(Fh - ( ~ )2 I v vc,h a,h . c,h, a,h(2.3.25) QRC,h2
1r /~ (chi-(c h)(ahi-~a h) (Yhij/Yh **)/
j=l ' , I~
=v vc,h a,h
= (Yh**-l) R2
hac,
where R h is the co~lation coefficient of response scoresac,
and subpopulation scores. The QRC,h have independent chi-
square distributions with d.f.=l. Thus, the total partial
association statistic
(2.3.26)q
QTRC = ~ Qh-' RC,h
58
is chi-square with d.f.=q.
The general form
is again appropriate to investigate Ho in terms of average
partial association, this time in the context of a
correlation - type statistic; QARC=Q, where
Ah
= [a 1 e c I]h h
for a h=(ah1 ' a h2 , ... a hs ) 1 and c h.. ( c h1 ' c h2 ' '.' . chs ) I, is
asymptotically chi-square with d.f.=l. It is directed at
the extent to which there is a consistent positive or
negative association between the subpopulation scores and
response scores across the strata, in an laverage sense. 1
If ridit or rank scores are applied, with midranks assigned
for ties, the QARC is equivalent to a partial Spearman Rank
correlation test.
2.4 Multivariate Randomization Statistics
There may be interest in assessing whether more than
one response variable is distributed at random across the q
strata. Suppose that there are d response variables, which
can be jointly cross-classified into
dr" n Lkk=l
59
response profiles, where the k-th response has Lk levels.
Thus, the data can still be considered as a q: S x r array,
but the frequencies now represent the joint distribution of
d response variables for s subpopulations. The
randomization hypothesis of no association can be expressed
as follows:
Ho:For each stratum h=1,2, ... q, the responsevariables are jointly distributed at r.andomacross the subpopulations
This hypothesis can be investigated with multivariate
extensions of the randomization statistics discussed in
section 2.3.
Let ~ = [a1 ,a2 , ... a d ]' denote a matrix of score
vectors for the h-th stratum, where a k (k=1,2, ... d)
represents a vector of scalings of the r response profiles
which construct summary measures for the d response
variables. For example, if two dichotomous response
variables were cross-classified to form four response
profiles, then appropriate binary scores might be
a 1 '=(0,0,1,1) and a 2 '=(0,1,0,1). A function vector which
represents the compounded mean scores is written
1(2.4.1) Fh =
and has expected value
1(2.4.3) Var{PhIHo} =
(Yh**-l)
r ,I IIAh(D -Ph *Ph *')Ah'l 8I P * .. IL h. .J
60
rII D-
l-1 1 '
Ips sL h*.
,II = VI Ph
.J
When the within stratum sample sizes are sufficiently large
(i.e. Yhi* ~ 30) the Ph have approximate mUltivariate normal
distributions via randomization central limit theory
arguments as previously discussed. One can then construct
quadratic forms in linear functions among these elements to
test Ho. Let C denote a (c x s) full rank matrix of among
subpopulation contrasts. Then a test statistic for the
comparison of the summary measures constructed from Ah in
the h-th stratum can be written
(2.4.4) QR,h
QR,h is approximately chi-square with dc d.f. and can be
thought of as a multivariate mean score statistic. For the
case where C=[I ,-1 ], then the test statistic QR h can bes s ,
shown to have a one-way analysis variance form. If rank
scores are used in the score matrix, given that the
dependent variables are ordinally scaled, then QR,h is the
multivariate Kruskal Wallis test (Puri and Sen 1971).
A total partial association statistic can be computed
61
when all the Yhi* are large.
is approximately ch1~square with d.f.=qdc when this is true.
However, it may be more appropriate to investigate Ho in
terms of an average partial association statistic. Such a
statistic would judge the extent to which consistent
patterns across the strata exist for the summary measures
constructed from the response variables. It has the fqrm:
( 2 . 4 . 5 )where
q
(2.4.6) F-E(F) = L (A 0 C)(Yh - mh )
h=land VF is the corresponding covariance matrix. QAR has an
approximate chi-square distribution with d.f.=dc.
2.5 Summary
There are a number of randomization test statistics
which can be applied to evaluate the hypothesis Ho as stated
in section 2.3.3. The Cochran-Mante1-Haensze1 approach to
the analysis of q:r x s contingency tables has been
presented in generality which includes matrix representation
and the analysis of multivariate response vectors. Average
partial association methods are emphasized since they
62
require less stringent sample sizes and do not require
homogeneity of subpopulation x response association, at
least not explicitly. It should again be noted, however,
that the alternative hypothesis of a consistent pattern of
variation implied by the use of average partial association
methodology is a narrow one. One needs to continually be
aware of this during its application. These methods have
another limitation in that the resulting inferences may only
apply to the actual subjects under study rather than to some
broader target population. For these reasons, such methods
may need to be supplemented by other statistical methoqs if
data description is an analysis objective.
CHAPTER III
WEIGHTED LEAST SQUARES METHODOLOGY
3.1 Introduction
"Randomization methods often need to be supplemented by
other procedures in an overall statistical analysis if there
is interest in describing the variation among a set of
estimates produced from the data in the context of a
statistical model. Hypotheses investigating the
significance of the various sources of variation can thus be
couched in terms of linear hypotheses concerning the model
parameters. With categorical data, weighted least squares
methods are often an appropriate means of fitting linear
models since the homogeneous variance assumptions of the
usual least squares strategies are not required. The
estimates being modeled might be those resulting from a
cross-classification of variables which were chosen by a
variable selection scheme (see Chapter IV). For example, in
the CHESS dataset, the estimates under study might be the
proportion of subjects who had asthma or the average number
of colds for a given year for an area x sex x pollution
index cross-classification. Section 3.2 reviews the
weighted least squares methodology which would be
appropriate for the analysis of such an example.
Since the dataset is longitudinal in nature,
64
consisting of a possible nine measurements over time for the
cold and asthma indication measures, repeated measurement
strategies are potentially advantageous in its statistical
analysis. The questions of whether time is an important
source of variation and whether it interacts with other
explanatory variables, as well as the nature of the pattern
of variation, can be addressed within a linear model
framework which takes into account the repeated measures
structure of the data. Section 3.3 gives an overview of the
application of linear models to repeated measurements data,
including some traditional approaches for continuous data.
In addition, those repeated measurement strategies which are
appropriate for categorical data are described.
A key feature of the CHESS dataset is the presence of
incomplete data. While there are 4002 observations which
are complete and can be analysed by themselves, there is a
lot of additional information in the remaining 16,343
observations which have three or six legitimate data points
but are also missing six or three measurements. This is
typical of most longitudinal studies, especially as the
number of subjects and/or measurement times grow large.
Section 3.4 discusses the strategies which can be employed
to deal with incomplete data. Specifically, it discusses
ratio estimation and supplemental margins, missing data
techniques which can be applied to categorical data.
65
3.2 Weighted Least Squares Methodology
3.2.1 Overview
Weighted least squares functional modeling is a
general strategy which allows one to describe the variation
among subpopulation summary measures which are functions of
within-subpopulation response distributions. The possible
choices for summary measures include a wide variety of
functions such as proportions, means, ratios, and kappa
statistics. The required elements of a weighted least
squares functional analysis (WLSA) are the specification of
the summary measures as a (u x 1) vector F, and a non-
singular, consistent estimator VF of the covariance matrix
of F. The function estimates are asymptotically normal, and
their asymptotic covariance matrix is model-free. Weighted
least squares can be applied to estimate the parameters of a
linear model
( 3 . 2 . 1 ) E {F} = P = XpA
where X is a (u x t) design matrix of full rank t ~ u and p
is a (t x 1) vector of unknown parameters. EA{ } represents
asymptotic expected value as the size of the sample on which
the summary statistic is based increases appropriately.
Another manner in which linear modeling can proceed is via
the construction and testing of hypotheses of the form
( 3 . 2 . 2 ) Ho : WP = 0
66
where ~ is a (u x 1) vector of expected values of the
functions under study, and W is a known «u-t) x u) matrix
of constraints. Both model specifications are linked, as
will be discussed. Wald statistics. are employed to assess
the appropriateness of both (3.2.1) and (3.2.2), and it can
be shown that the Wald goodness-of-fit for (3.2.1) is
identical to the minimized weighted sum of squares of the
residuals resulting from the observed values of the summary
measures and their model-predicted values. In addition,
when the data being analyzed are distributed as Poisson or
product multinomial, both the weighted least squares results
and the Wald statistic are equivalent to that which would be
obtained from minimum modified chi-square methods. The only
data situation discussed in this section will be product
multinomial.
The use of WLSA does require a probability model
linking the observed data to some sort of extended
population, in contrast to the randomization strategies of
Chapter II, the distribution assumptions for which were
conditional on aspects of the study design for the observed
data themselves. Use of the product multinomial
distribution is appropriate for data arising from stratified
simple random sampling, and potentially satisfactory for
sampling schemes which are arguably equivalent.
The discussion of the weighted least squares
methodology which follows is based on the presentation of
Grizzle, Starmer, and Koch (1969) as well as Koch et a1.
67
(1985). Computational details are also discussed in Landis,
Stanish, Freeman, and Koch (1976), which is the description
of a Fortran computer program named GENCAT which .implements
most of these strategies.
3.2.2 Statistical Theory for Weighted Least Squares
Assume that there exists s sUbpopulations indexed by
i=1,2, ••. s from which independent samples of size n i * have
been selected. Let j=1,2, •.. r index a set of response
profiles determined by one or more response variables which
uniquely classifies all of the subjects, and let Yij d~note
the frequency of the j-th response profile for the subjects
in the i-th subpopulation. Table 3.1 summarizes the data in
the form of a s x r contingency table.
Table 3.1
General Contingency Table for the Frequencies of ResponseProfiles for a Set of Subpopulations
SUbpopulation Response Vector Response Profile Total1 2 ..... r
1 y ,= Y11 Y12 ... y 1r n 1*2 y1,=
2 Y21 Y22 ... Y2r n 2*
s Ys I- Ys1 Ys2 Ysr ns*
Total Y*l Y*2 Y*r n**
Let YiI represent the vector of responses
Yi ' =(Yi1'Y 2' .•. Y r) for the i-th subpopulation. If
Y=(Y1"Y2', ...ys')' denotes the sr x 1 vector of all the
individual frequencies, then Y follows the product
multinomial distribution, given the assumed sampling
framework. The likelihood function is expressed as
(3.2.3)s
Pr{y} = n {i=l
68
where wij is ~he probability that a randomly selected
sUbject from the i-th subpopulation has the j-th response
profile. The {wij } satisfy the natural restrictions:
(3.2.4)r1: wij =1
j=lfor i=l, 2 , ... s.
Unbiased estimators for {wij } are the sample proportions
Pij-Yij/ni *. The covariance matrix of the {Pij} includes
the following components:
(3.2.5)
hef'
Vector notation allows one to express this more succinctly.
Let wi-(wi1,wi2"",wir)' and Pi=(Pi1,Pi2, ...Pir)'
(Yi/ni*) represent the parameter vector and its sample
estimate for the i-th population. Let w-(w 1 ',w2 '""ws ')'
denote the compound vector of paramters for all s
subpopulations and similarly, let P=(P1',P2', ...Ps')' denote
the sample proportion vector. It follows that
( 3 . 2 . 6) E ( p) = ~, Var ( p )
where
rII V 1 ("!l)I"IQrr= IIIL.
o'., rr
,I
. II
. . II
V (n ) I,s - s I
.I
69
(3.2.7)
is the covariance matrix for the i-th subpopulation. A
consistent estimator for V(n) is V(p), where p has been
substituted for n.
Functional linear models for the vector n are
expressed as
(3.2.8) F(n) = Xp
where F(n)=[F1 (n),F2 (n), .. Fu (n)]' is a set of u ~ s*(r-l)
functions, X is a known (u*t) design (specification) matrix
with rank t~u, and p is a t*l vector of unknown parameters.
Each of the functions is required to have continuous partial
derivatives through order two with respect to p in an open
region containing n=E(p). Another condition that F must
satisfy is that its covariance matrix be non-singular. This
covariance matrix can be written
(3.2.9) VF(n) = [H(n)][V(n)][H(n)]'
70
where H(n)=[aP/'zlz=n] is the first derivative matrix of
P(z). A sufficient condition for VF(n) to be non-singular is
Rank{H(n)]',[l r e Is]} = Rank{H(n)} + s.
This states that the functions F(n) and the natural
restrictions are linearly independent in an open region
containing n. Finally, there must be sufficiently large
sample sizes {ni *} for the estimators F(p) to have an
approximate multivariate normal distribution.
Given that the above conditions are satisfied, the
model (3.2.8) is asymptotically equivalent to the linear
Taylor series
(3.2.10) W[P(p)] + W[H(p)][n-p] = O(U-t)
for the constraints
(3.2.11) W(P) = O(u-t)
where W is a known [(u-t) x u] orthocomplement matrix to X
at the sample estimator p. By re-expressing (3.2.11) and
providing for the natural restrictions (3.2.4), a linear
structure for n is obtained which provides a basis for
71
finding an estimator n which minimizes the Neyman chi-square
criterion
(3.2.12 )
Bhapkar (1966) demonstrated that
(3.2.13)
where QW(F) is a Wald statistic (1943) for the model
corresponding to the linearized constraints (3.2.10). In
turn, a matrix identity lemma in Koch (1969) allows one to
show that QW is identically equal to
(3.2.14)
where
(3.2.15)
is the weighted least squares estimator for ~ in (3.2.1).
QN' QW(b), and Qw(F) all are approximately chi-square with
(u-t) degrees of freedom. Thus, b is the minimum modified
chi-square estimator, and as such is a member of the class
of best asymptotic normal(BA~) estimators as demonstrated by
Neyman (1949). These properties of b, and the ease with
which QN=QW can be calculated are reasons why the weighted
least squares approach to categorical data analysis was
promoted by Grizzle, Starmer, and Koch in their 1969 paper.
When the subpopulation sample sizes become
72
sufficiently large, b has a multivariate normal distribution
under the model (3.2.8):
(3.2.16)
If the model (3.2.1) does adequately characterize the data,
as indicated by the goodness-of-fit criterion Qw, then
linear hypotheses of the form C~=O, where C is a known c*t
matrix of constants of rank c, can be tested with the Wald
statistic
(3.2.17)
-1 -1where Vb={X'VF X} is a consistent estimator of Var{b}. QC
is approximately chi-square with d.f.=c under Ro· Qc is
identically equal to the difference between the Wald
statistic Qw for the model (3.2.8) and the goodness-of-fit
statistic Qw,c for the reduced model Xc=XZ where Z is a
known (t x (t-c)) orthocomplement to C. This means that Qc
is effectively a test statistic for the additional
constraints implied by XC. Predicted values F=Xb are useful
to calculate in order to facilitate model interpretation;
VF=XVbX' is a consistent estimator for their covariance
matrix.
In the original Grizzle, Starmer, and Koch paper
(1969), two types of functions were discussed. These
included those for the strictly linear model
(3.2.18)
and the log-linear model
F(n) = An
73
(3.2.19) F(n) =A [log n]
In the first case, the estimated function is F(p)=Ap and the
covariance structure is VF=AVpA I . For the log-linear model,
-1 -1F(p)=Alog(p) and VF=ADp VpDp AI. While many applications
of WLSA are covered with such functions, other applications
may require more complicated functions. These can usually
be generated as a sequence of linear, 10garithmic,and
exponential operations on the vector p. Consequently, VF(n)
is estimated by VF=[H(p)]Vp[H(P)]I (see 3.2.9) where H(p) is
a product of the first derivative matrices {Hk(p)} where k
indicates the k-th operation in accordance with the chain
rule. These matrices are relatively simple:
(i) for linear functions Az, H=A
(ii) for log functions log z, H=D-1z
(iii) for exponential functions exp z, H=Dexp z
Forthofer and Koch (1973) examined functions of the form
(3.2.20)
for which
(3.2.21 )
and
74
(3.2.22)
for which
(3.2.23)
Here,al=AlP,a2~ exp(A2logal), and a3=A3a2. Some functions
which take the form (3.2.20) or (3.2.22) include complex
ratio estimates such as rank correlation coefficients, kappa
statistics, and survival rates.
3.2.3 An Example of a Strictly Linear Model
One frequent application of weighted least squares is
tor a strictly linear model of the form
(3.2.24) P(n)= An = Xp
Such a model is appropriate when the An represent
subpopulation means or marginal probabilities. A is a known
(u*sr) matrix with full rank u 5 s(r-l) and its rows are
oblique to the natural constraints (3.2.4), i.e.
(3.2.25) Rank{A ' ,[l r e Is]} = Rank{A} + s
For this model, the estimated function vector is P=Ap and
the corresponding estimated covariance matrix is Vp=AVpA ' .
Thus, the WLS estimators for p can be written
(3.2.26)
and thegoodness-of-fit statistic QW 1s expressable in terms.
75
of X,A, and Vp as
The following example illustrates how WLSA is applied in a
strictly linear model situation.
Table 3.2 contains a subset of the CHESS data. Those
individuals in the California areas are cross-classified by
area, sex, and the number of colds which they reported in
1973.
Table 3.2Responses for Those California CHESS Subjects Who Reported
the Presence or Absence of Cold Symptoms in 1973
Area Sex 0 Colds 1 Cold 2 Colds 3 Colds
California I Male 377 187 74 18e California I Female 281 184 95 29California II Male 241 133 64 23California II Female 173 130 76 25
Total 1072 634 309 95
Note that no problems are produced by this area x sex x
colds crosstabulation. With the smallest subpopulation
sample size being n 4 *=404 (i=1,2,3,4), the application of
asymptotic theory to the analysis is justified.
In order to proceed with a linear model analysis of
the data, one needs to assume that the subjects in each of
the area x sex subpopulations are conceptually
representative of some corresponding larger subpopulation in
a manner consistent with stratified simple random sampling.
Then, the data of Table 3.1 can be described by the
76
following likelihood:
n 11'j=l ij
(3.2.26) Pr{y}=
4
ni=l
{ni*
4 Yi j / y }
i j !
where 11'ij denotes the probability that a randomly selected
subject from the i-th area x sex subpopulation has the j-th
response (j=1,2,3,4). Yij represents the number of subjects
from the i-th subpopulation with the j-th response. Since
the responses are ordinally scaled, ranging in intensity
from low to high, mean scores are an appropriate function to
analyse. Consider the following response function:
(3.2.27)
,III =Ap={[O 1 2 3] e I 4 }pIIII
oJ
Pre-multiplying p by A creates a function vector containing
estimates for the mean number of colds for each
subpopulation. P(p) is approximately multivariate normal .
and has the estimated covariance matrix VF=AVpA. '
Since there is no a priori model for these functions,
the cell mean model is first fit to gain information on the
significant sources of variation for the mean estimates.
This model can be expressed as
E{F(p)} = Ap = Xp = I 4P = PI
In this case, the estimates for PI are bI=F. Since this is
77
a saturated model, there is no goodness-of-fit statistic
defined. Linear hypotheses were undertaken to assess the
sources of variation in PI. The first hypothesis to be
investigated is whether there are differences among the 4
subpopulations, i.e.
Ho: PI1 = PI2 = PI3 = PI4
The corresponding C matrix is
r 1I 1 0 0 -1 II I
C = I 0 1 0 -1 II II 0 0 1 -1 IL ~
QC = b'C'(CVbC,)-1Cb = 32.39(d.f.=3) for this hypothesis
test, and is clearly significant(a=.01). This indicates the
need for additional investigation. The two California areas
are compared with their averages over sex via the contrast
matrix
C = [ 1 1 -1 -1 )
and found to be significantly different as the resulting Qc= 8.22(1 d.f.). Similarly, the hypothesis of a sex
difference is tested with a C matrix investigating their
averages over area:
C = [ 1 -1 1 -1 )
QC=20.96(d.f.=1) and is also clearly significant. Sex x area
78
interaction is investigated with
C = [ 1 -1 -1 1 ]
This is also a test of the additivity of sex and area.
QC=.08 and is non-significant. This finding is equivalent to
finding the estimates compatible with the reduced model
r 11 1 0 0 1 r 11 I 1 1I 1 0 1 1 1 13 1 1I 1 1 I
E{F{p)} =1 1 1 0 1 1 13 2 I = XR1 1 1 11 1 1 1 1 I 13 3 II I 1 1L. .. L. ..
13 1 represents the predicted value for California I males,
while 13 3 represents an increment for females and 13 2
represents an increment for California II. The estimated
parameters and their covariance matrix are
b = [.596 .113 .181]
r 1I .8108 -.5435 -.59951
Vb = I I1-·5435 -1.500 -.03911 x 10-3
1 I1-·5995 -.0391 1. 4331L. ..
The goodness-of-fit statistic for this model is identical to
the test statistic for the no interaction hypothesis since
that hypothesis is the implied constraint for the model.
79
3.2.4 Case Record Data
Often, a subject is classified according to d
categorical variables for which the k-th has the possible
outcomes j=1,2, ... Lk . Thus, there would be
dr = n L
kK =1
possible multivariate profiles. When d>3, this number can
become very large. Subsequently, the s x r proportion
vector becomes large, may include many zero frequencies, and
the necessary matrix operations required to produce a
covariance matrix difficult to perform from a computational
point of view. These potential problems can be circumvented
by operating on the raw data for each sUbpopulation to form
the function of interest, which will generally be
subpopulation means. This is referred to as case record
analysis.
Specifically, let there exist s subpopulations from
which samples of size {ni *} have been selected. Let
1=1,2, ... n i * index the subjects from the i-th subpopulation.
Then, let Yil=(Yil1'Yi21' ...Yidl)' represent the vector of
responses for the l-th subject. The sample mean for the k
th response in the i-th subpopulation can be expressed as
80
1(3.2.28)
(k) (k) (k)where n i =(ni1 , ... ,niLk ) denotes the vector of
frequencies nij(k) for the j-th outcome of the k-th response
in the sample from the i-th subpopulation. Let
ak=(ak1,ak2, ... akLk)I represent a set of finite scores
corresponding to the outcomes of the k-th response. The
corresponding vector of proportions is written
P (k)=(n (k)/n )i i i*·
It follows from the Central Limit Theorem that the
Yi=(Yil'Yi2"'Yid) have approximate multivariate normal
distributions when the n i * are sufficiently large (ni * ~20)
with
1(3.2.29)
with Pi being the vector of means for the i-th
subpopulation, and the covariance matrix L. estimated by1
(3.2.30)1
ni*L (Yil-Yi ) (Yil-Yi ) I.
1=1
If Y= (y11 ,y2
1 , •••ysl)' denotes the vector of sample means
for all the responses for all sUbpopulations, and
P=(P1
,P2', .••P
S)' denotes the corresponding expected value
81
vector, then a strictly linear model can be written as
(3.2.31) E{y} = P = X~
where X is a known (ds*t) design matrix with full rank t ~
ds and ~ is a (t*l) vector of unknown parameters. The
covariance matrix of y has the following consistent
estimator:
(3.2.32) VY
1
,I
Odd IIIII
1. I----- ~ I
n sls* ~
At this point, weighted least squares estimation can be
applied as in section 3.2.2 to generate an estimate b for ~.
Similarly, the Wald statistic QW is appropriate to assess
the goodness-of-fit of the model (3.2.31), where W is an
[(sd-t) x sd] orthocomplement to X, and has an approximate
chi-square distribution with d.f.=(sd-t). For the same
types of arguments as applied in section 3.2.2, b has an
approximate multivariate normal distribution, so that linear
hypotheses of the form Ho:C~=O can be investigated with the
test statistic QC. V ={XlV_- 1X}-1 is a consistent estimatorb y
of Var{b}. Qc is distributed as approximately chi-square
with d.f.=c.
82
The weighted least squares methodology is also
appropriate for the more general functional linear model.
Let p(~)=[pl(~),p2(~)'.. pu(~)]' be a set of ~ds functions
of interest, X be a known (u*t) design matrix of full rank
tsu, and P be an unknown (t*l) vector of parameters. 1'( )
must satisfy the conditions of having continuous partial
derivatives through order two in an open region containing
~, and also VI' must be nonsingular, where
r , r , ,la I' I I laP I
V =1---1 IV 1---1 II' laz Iz=YI Ylaz Iz-yl
~ ~ ~ ~
Thus, functional linear models which apply to mean
vectors {Yi
} are conceptually analogous to those which apply
tomultinomially distributed counts {Yij}. There is an
advantage, however, in that the sample size requirements are
less stringent since they pertain to the {ni *} instead of
the individual {Yij}. This is due to the fact that one is
supporting multivariate normality for a mean vector as
opposed to a compound vector of proportions. Also, working
with case record data to calculate y and V_ is often morey
computationally feasible than working with the corresponding
contingency table.
83
3.3 Overview of Repeated Measurement Analyses
A repeated measurements study can be broadly defined
as one in which each of the units investigated are measured
under two or more different conditions. Some types of
studies encompassed by repeated measurement designs are
split-plot experiments and change-over design studies.
Another class which includes the CHESS study is longitudinal
studies; for these, the different conditions represent time.
These are especially appropriate when the outcome of
interest may exhibit trend; consequently, many medical and
epidemiological investigations are longitudinal in nature.
While some investigations involve the use of just one
of the multiple responses recorded for a given subject (e.g.
endpoint analysis, see Gould 1980), most take advantage of
the multivariate nature of the data. The type of analysis
which is appropriate for a repeated measurements study
depends on a number of factors. These include whether the
point of the statistical analysis is to address specific
hypotheses or to generate a descriptive linear model. The
structure of the pertinent covariance matrix, the
distributional assumptions which can be made for the
variables under study, and the sample sizes are involved.
Potential strategies include profile analysis, univariate
analysis of variance for a summary measure of the response
variables, MANOVA for two or more summary measures, repeated
measurements analysis of variance when the respective
84
covariance matrix is compound symmetric, and generalized
linear modeling for serial measurements. Koch, Amara,
Stokes, and Gillings (1980) contains an overview of the
customary approaches to repeated measurements data and the
situations in which they apply. The following discussion is
directed at some of these commonly used strategies for
repeated measurements in the continuous data setting.
Table 3.3.1 displays a general data structure for
repeated measurements data.
Table 3.3.1Data Structure for Repeated Measurements
Subject inGroup Group Responses Within Group
1 1 Y111 Y121 Y1d11 2 Y121 Y122 Y1d2
1 n 1 Y11n1 Y12n ... Y1dn1 . . . 1
s n Ys1n Ys2n Ysdns s s s
Let Yil = (Yi1l'Yi2l' •.. Yidl) I represent the vector of
responses for the l-th subject in the i-th group where
i=1,2, ... s and l=1,2, ... n i . n i is the total number of
subjects in the i-th group. If interest lies in comparing
the response profiles among group~, when the measurement is
interval scaled, then standard multivariate analysis of
variance techniques can be applied. Such comparisons might
involve testing for group and/or condition differences, or
for (group x condition) interactions. The usual general
linear model is written
( 3 . 3 . 1 ) E{Y} = X~
85
where Y =. (Yli"Y12 1'''.Ylnl l 'Y21 1 ' Ysns l )1 is the (n x d)
observation matrix, Yil = (Yill'YiI2' Yild)1 is the vector
of d observations for the l-th subject in the i-th group,
and n is the total number of subjects. X is an n x q
~pecification matrix for the respective groups, and ~ is a
(q x d) matrix of unknown parameters. The Yil are
distributed as independent, multivariate normal with
expected value Pi and covariance matrix ~, where ~ is
positive definite. It is assumed that there is no missing
data.
Under these specifications, the elements of ~
represent response variable means for each of ~he s groups.
The least squares estimator of ~ is
(3.3.2) b = (X1X)-lX1Y
The questions of interest can be addressed through linear
hypotheses concerning the elements of ~ of the form
(3.3.3) C~A = 0
where C is a c*q matrix of constants and A is a (d x s)
matrix of constants. Using the between groups sums of
crossproducts matrix
86
and the within groups sum of crossproducts matrix
-1one can construct appropriate test statistics from QHQE
such as Wilk's lambda criterion, Roy's largest root, or the
Lawley-Hotelling trace criterion. See Timm (1975), Timm
(1980), or Morrison (1967) for further details concerning
such tests.
The question of whether there is a group by condition
interaction is investigated with (3.3.3) with the matrices
(3.3.4)
Keeping C as expressed in (3.3.4) and writing A=Id
would
result in testing whether there are differences among the
groups; similarly, keeping A as expressed above and writing
C=Is would result in the test for differences among the
conditions. If the hypothesis of no group x condition
interaction is tenable, then the hypothesis of no group
differences is equivalent to the hypothesis of equal group
means. This can be accomplished by writing A as A = 1d and
keeping C as in (3.3.4). If Y were postmultiplied by A, one
would be entertaining a univariate analysis of variance on
group means. Similarly, writing C = [n1 ,n2 , ... nS
] and
keeping A as in (3.3.4) would yield a test of equal
condition means, given that parallelism was again a valid
87
assumption.
Another possible choice for A in expression (3.3.4) is
for it to consist of an orthonormal set of all contrasts.
If it can be assumed that the covariance matrix of A'y can
be written
Var{A'y} = A'VA = Iv
a property known as sphericity, then univariate results can
be obtained. A model incorporating sphericity as an
assumption is the traditional mixed model analysis of
variance. This assumption is equivalent to the assumption
of homogeneous variances and covariances, which is often
reasonable in split-plot experiments where conditions are
randomized within subjects and the observational units are
usually interchangeable. Also, such experiments often
include limited numbers of observations per group such that
multivariate methods are ineffective. Thus, treating the
situation as univariate becomes essential in developing
analyses.
The appropriate model is written
( 3 . 3 . 5 )
with the {si.l) representing independent sUbject effect~,
(P ij } representing fixed effects for the i-th group and j-th
response, and {eijl } denoting independent response errors.
An appropriate F statistic with which to test the hypothesis
88
*of no group x condition interaction can be written F =
MSGC/MSE (where MSGC indicates the mean square due to group
x condition and MSE mean square error) with a rejection
*region of F >' F1_a
[(d-1)(S-1),(d-1)(n-s)]. See Timm
(1975), Morrison (1976) and Koch, Elashoff and Amara (1985)
for details concerning repeated measurements mixed model
ANOVA tables. Also, several authors have discussed
correction factors which can be applied to generate
approximate F-tests when sphericity is not satisfied. See
Greenhouse and Geisser (1959), and Huynh and Feldt (1976).
Sphericity is usually an unreasonable assumption in
longitudinal studies. Often, those subject measures closely
spaced will be more highly correlated than those further
apart. However, due to missing data or the need to account
for condition-varying covariates, multivariate analysis of
variance methods may also be unsuitable. A strategy gaining
in use is to consider the problem a univariate regression
situation in which the responses are correlated. The data
situation is that of having n observation vectors Yi
(i=1,2, ... n), each with d i measurements reflecting d i
conditions. One assumes that a linear model can be written
(3.3.6)
where Xi is a design matrix for the i-th individual
reflecting functions of time and covariates. The Yi are
considered to be distributed as multivariate normal.
However, the covariance structure
89
Var{yi } = L i
is not presumed known, as model fitting consists of
estimating both p and L i . Two types of models considered
for this problem are random effects models and
autoregressive models. Ware (1985), Laird and Ware (1982),
and Cook and Ware (1983) contain discussions of such linear
models for longitudinal data.
3.4 Repeated Measurements Analysis for Categorical Data
Linear modeling of repeated measurements data when the
data is categorical in nature lends itself quite readily to
the weighted least squares methodology discussed in se~tion
3.2.2. While decisions concerning the manner in which to
deal with correlated within subject responses is a major
facet of continuous case analyses, resulting in strategies
such as modeling patterned covariance matrices, WLSA was
specifically intended for modeling correlated functions.
Attention does need to be paid to the selection of
appropriate functions to analyse, since adding response
variables due to time or conditions to any categorical data
setting increases the chances of zero frequencies in many of
the contingency table cells produced by cross-classifying
all subpopu1ation levels and response variable levels.
The data structure involved can be summarized as
follows. Let the response outcome consist of L possible
categories, and have the response be measured over d
conditions. Thus, there will be r=Ld possible multivariate
response profiles. In Table 3.1, each entry Yij will
90
represent the frequency of a particular response profile for
the i-th group. These frequencies are better denoted as
Yij
where j is a vector subscript j=(j1,j2, .. jd),jg=1,2, •.. L for
g=1,2, .. d. Thus, j will have as its values a particular
profile. Let n ij denote the joint probability of response
profile j for subjects from the i-th subpopulation. A
function which allows one to address the usual questions of
interest in a repeated measurements analysis (i.e. group and
or condition differences, group x condition interactio~) is
the first order marginal probabilities. The following
discussion of the analysis of these and other functions is
taken from Koch et ale (1977), and Landis and Koch (1979).
The first order marginal probability is written
i=1,2, .. sfor g=1,2, .. d
k=1,2, .. L
which represents the probability of the k-th response
category for the g-th condition in the i-th subpopulation.
The following hypotheses can be investigated in terms of the
marginal probabilities:
H1 : There are no differences among the marginaldistributions of the respective attributesat each time point for the s subpopulations
There are no differences among the marginaldistributions of the respective attributesover the d time points within each of thesubpopulations
91
H3 : There is no group x condition interaction withrespect to the marginal distributions
If H1 is true, then the {t igk } satisfy the following formal
hypothesis:
HSM : t 1gk = t 2gk = ... = t sgk for g=1,2,.d and k=1,2, .. L
If there are no differences among the conditions (H2 ), then
the following hypothesis of first order marginal symmetry
(homogeneity) must hold:
HeM: t i1k = t i2k = ... = t idk for i=1,2,.s and k=1,2, .. L
If H3 holds, then the {t igk } can be described by an additive
model
HAM: t igk = Pk + ~i*k + T*gk
Here, i=1,2, ... s; g=1,2, ... d, and k=1,2, ... L. ~k is an
overall mean associated with the k-th response category,
~i*k is an effect due to the i-th subpopulation, and T*gk is
an effect due to the g-th condition. It is assumed that the
usual ANOVA constraints are satisfied by the {~k}'{~*gk}'
and {T*gk}; also the t igk sum to 1 for each g,k.
If the response categories are ordinally scaled, then
mean scores may be an appropriate function to analyse. Let
~. be a mean score for the responses for the g-th~g
condition in the i-th subpopulation:
i=1,2, ... Gfor
g=l ,2, ... d
92
The {ak } represent appropriate scalings. The hypothesis
HSAM : ~ig = ~2g = "'~sg for g=1,2, ... d
is satisfied when there are no differences among the
subpopulations, and the hypothesis
HCAM : ~i1 = ~i2 = .• = ~id for i=1,2, ... s
is satisfied when there are no differences among the
conditions. HSAM is implied by HSM ' and similarly, HCAM is
implied by HCM ' Additionally, a model similar to HAM can be
specified for the hypothesis of no group x condition
interaction.
In situations in which sample sizes are moderate at
best, the mean score may be the most appropriate function to
analyse. When attention is directed at the first order
marginal probabilities {tigk }, then sample size requirements
are that most of the marginal frequencies be greater than 5,
and the subpopulation sizes be greater than 20. Situations
where L is larger often involve ordinally scaled data so
that mean score functions are justifiable.
While both estimates of marginal probabilities and
mean scores can be generated from the application of A
matrices to a vector of proportions p produced from a
contingency table, repeated measurements data frequently
lends itself to the case record strategy described in 3.2.4,
especially for generating mean scores. For the repeated
measurements situation, the components of the function
vector Vi would represent mean scores for each of the k
93
weighted least squares analysis, utilizing Wald statistics
to investigate hypotheses of interest, is essentially the
same type of regression analysis discussed in section 3.2.2.
3.5 Missing Data Strategies for Categorical Data Analysis
For most longitudinal data studies, there are subjects
with incomplete data. The easiest way to deal with missing
data would be to delete those observations which include
missing values (list-wise deletion), but this often leads to
the loss of a great deal of information if the number of
observations with incomplete data is large. Also, this may
lead to biased results. Timm (1970) and Gleason and Staelin
(1975) discuss strategies for missing data in the general
multivariate linear model setting in which attention is
focused on estimating the covariance matrix. Some
techniques discussed involve estimating the missing data
from regression estimates or principal components estimates
from the complete data. Kleinbaum (1970) proposed
estimating the covariance matrix from pair-wise complete
data. Still others have suggested the derivation of maximum
likelihood estimates on the assumption that the data are
distributed as multivariate normal. Laird and Ware (1982)
discuss the use of the EM algorithm to derive ML estimates
in a longitudinal data setting under the assumption of
random-effects models.
Sections 3.4.1 and 3.4.2 discuss two missing data
strategies which are appropriate in a categorical data
setting; Section 3.4.1 discusses the use of a ratio
94
estimation procedure which is applicable to either
continuous or categorical data. Section 3.4.2 is concerned
with the strategy of supplemental margins in which those
subjects with incomplete data are treated as a subset of the
entire dataset and used in the estimation of parameters for
which they contain pertinent information.
3.5.1 Ratio Estimation
A missing data strategy which can be applied to either
continuous or categorical data and involves neither the
estimation of missing values nor distributional assumptions
is that of ratio estimation, which is discussed in Stanish,
Gillings, and Koch (1978). This procedure involves the use
of multivariate ratio estimates incorporating the use of
indicator variables which denote the presence or absence of
data for a particular response variable. The basis for the
method is in a paper by Cornfield (1944). This strategy is
applied in a case record analysis setting. The procedure is
aSYmptotic, and thus requires sufficient sample sizes for
each of the sUbpopulations under investigation in order that
the ratio estimators be approximately normal and the
estimated covariance matrix consistent. Asymptotic
regression methodology can then be used to describe the
variation among the estimates and address hypotheses of
interest.
Let Yil=(Yi1l'Yi2l' ... Yidl)' represent a set of
responses for d outcome measures for the l-th sUbject in the
i-th subpopulation. The d variables can be either continuous
95
or categorical in nature. In the repeated measurements
setting, the d variables might denote d times or conditions.
Let the expected value of Yil=Ui=(Ui1'Ui2""Uid) '. The
basic problem is to determine an estimate Ui for Ui and to
estimate its covariance matrix in the face of missing values
among the components of the Yil'
Let ui=(ui11,ui21, ... uidl)' be a random vector of
indicator variables, where u ikl has the value 1 if the
response Yikl is observed and o otherwise. The ratio
estimator of Uik can be expressed as
( 3 . 5 . 1 )
r nI 1
R. k=1 I:~ I
L. 1=1
(f In)1kl i
,II1
.J
r n ,-1I i II I: u In II ikl ilL. 1=1 .J
= exp{log(f )-log(u )}ik ik
where fikl=Yikluikl' Thus, f ikl takes the response value if
it is observed, and otherwise is set to O. This estimate is
equivalent to what would be obtained if the missing data for
the k-th response variable were replaced with the mean of
the complete data. However, since the construction of the
estimator is based on the sample as a whole, one can take
advantage of its structure to estimate a covariance matrix.
Let gil=(f i11 , fi21,.··fidl,ui11,ui21,·.·uidl)and gi be
the sample mean vector of the gil's. Then the ratio
96
estimator can be written
(3.5.2) Ri = exp{Alog(9i )}
with A=[Id,-Id ]. The covariance matrix of 9 i is written
(3.5.3) v ...9
i
since Ri is a function of 9 i , a consistent estimator for the
covariance matrix of Ri can be written as
(3.5.4) -lV D -lAID
gi gi Ri
( 3 • 5 .5)
as a consequence of applying the operation
v = HV HIF gi
the appropriate number of times where H is the matrix of
first derivatives of the fqnctions F evaluated at gi. This
will yield a positive semi-definite covariance matrix.
Stanish, Gillings, and Koch (1978) derive the form of the
general term of VR , and illustrate that the covariance
between the k-th and kIth ratio estimators is
r 2 1=Inkkl I
(3.5.6) CoV{Rik,Rikl} I I vkklInknkl IL .J
where nk is the number of observations with data for the k-
th response variable, nkl is that corresponding value for
97
the kIth variable, and n kk, is the number of observations
which have values for both variables. v kk, denotes the
covariance between the k-th and k'-th ratio estimators,
based on
the term
which is
these nkk, observations. Thus, one can consider
2{nkkl /nknkl } a missing data correction factor
applied to the conditional covariance to generate
Now, letg=(gl,g2, ...gS)' be the vector of means for
the s subpopulations, and similarly, R=(R1,R2,.~.Rs)' with
expected value P=(P 1 ,P 2 , ... P s )'. Analysis of R can then be
undertaken via asymptotic regression models of the for-m
( 3 • 5 • 7 )
An assumption of the ratio estimation procedure is that the
missing data occur at random, i.e. whether or not a variable
is observed is not related to the value it would have had if
it had been observed. The missing data indicator vector u il
should not be associated with the data vector Yil. There is
also a limit to the amount of data which can be missing.
Strictly speaking, this should probably be at most ten
percent.
3.5.2 Supplemental Margins
Another procedure which allows one to deal with
missing data in the categorical data setting is referred to
as supplemental margins. This strategy is more appropriate
for frequency data, and is described in Koch, Imrey and
Reinfurt (1972) and Lehnen and Koch (1974). The data in
98
this procedure are considered to have two components, those
observations which include values for all possible response
variables, and those observations which only have values for
subsets of the'response variables. The method is suitable
when the missing data occurs at random, due to either non-
response or bad data, or because additional samples have
occurred which did not include measuring all the response
variables. Even then, it must be assumed that whatever
process determined the subset of response variables measured
was independent of the values those observations would have
for all the variables.
The methodology involved is very similar to that in
the weighted least squares analysis of frequency data as
described in section 3.2.2. Let the data vector be written
as Y=(Y1' 'Y2',···ys')', where Yi'=(Yi1'Yi2' ...Yiri). Here,
i=1,2, ... s indexes the independent subpopulations and
j=1,2, ... r i denotes the response category in the i-th
subpopulation. Note that the number of response categories
and what they represent is allowed to vary from
subpopulation to subpopulation. Y can be considered to
follow the product multinomial distribution and its
likelihood function is written
(3.5.8) Pr{y)
where wij
is the probability that a randomly selected
subject from the i-th subpopulation has the j-th response
99
for that subpopulation. Note the difference from the
likelihood (3.2.3) due to the use of r i instead of an
across subpopulations r.
Let Pi = (yi/ni *) represent the proportions vector for
the i-th subpopulation, and let PG be a compound vector
defined by
E(p)=nG, where wG'= (w 1 ',w 2 ' , ...ws '). Var{PG} will be a
block diagonal matrix with the i-th block given by
A consistent estimator for Var{PG} will be the block
diagonal matrix V(PG)' where Pi is substituted for wi.
V(PG) has the same form as V(p) discussed above in a
'complete' multinomial data setting, but has a different
dimension since the diagonal blocks will be of size ri*r i
instead of a uniform r*r.
Let F=[(F1 (PG),F2 (PG), ••. Fu (PG)]' be a set of u
functions of PG. If the previously stated condition of
requiring the elements of F to have partial derivatives with
respect to PG in a region containing wG is upheld, then the
covariance matrix of F is estimated by
r , r , ,= IH(PG)I V IH(PG)I
L. .J p L. .J
G
100
where
r=1 aF II-II a z IL.
,II
z=p 1G..
A linear model describing the variation among the function
estimates can be expressed as
EA{F} = X~
For many situations, F(PG) will be linear functions
obtained from F = APG: these would be estimates of marginal
probabilities or cell probabilities. An example will serve
to illustrate the use of supplemental margins in an
analysis.
101
Table 3.5.1Presence or Absence of Cold Symptoms in
1973,1974, and 1975 for Female Subjects inBirmingham
Years Pattern of Response Frequency.Included 1973 1974 1975
73,74,75 Y Y Y 175Y Y N 80Y N Y 46Y N N 46N Y Y 64N Y N 64N N Y 46N N N 70
73,74 Y Y 51Y N 19N Y 25N N 26
73,75 Y Y 17N Y 6Y N 9N N 8
74,75 Y Y 233Y N 127e N y 87N N 112
73 Y 433N 271
74 Y 363N 235
75 Y 444N 317
Table 3.5.1 contains the frequencies of responses
corresponding to the possible response profiles for whether
or not a cold was reported during each of the years 1973,
1974, and 1975 for remales in Birmingham. The data are
considered to come from seven subpopulations, corresponding
to the possible combinations of years for which the subjects
had data present. The number of possible response profiles
r i ranges from 2 for the single year subpopulations to 8 for
102
the complete data subpopulation.
Let the data be represented by the array
r , r ,IY'345 1 = 1115 80 46 46 64 64 46 10 IIY' I = I 51 19 25 26 Ily,34 I = I 11 6 9 8 Ily,35 I = 1233 121 81 112 Ily,45 I = 1433 211 Ily,3 I = 1363 235 Ily,4 I = /444 311 IL 5 .I L .I
Let the proportion vector for the i-th subpopulation be
represented by Pi = (Yi/ni*)' where i takes the values
{345,34,35,45,3,4~5} for the combination of years 1913,1914
and 1915. These linear functions which are the marginal
probabilities of having colds for the years 1913,1914 and
1915 can be generated by
where A is the following 12*26 transformation matrix:
r111100001100110010101010
11001010
11001010
11000011
10
A =
L
,IIII/1III/
10 /101
.I
It follows that the functions can be described with the
following model:
EA{F} = Xn
103
where 11" = (11'13,11'74,TT 75 ) is the parameter vector where TT k
denotes the probability of having a cold in the k-th
year(k=1913,1914,1915).
r ,1 0 010 1 010 0 111 0 010 1 01
E{F} = X11' = 1 0 010 0 110 1 010 0 111 0 0\0 1 010 0 1\
L ,J
The estimates for the covariance matrix of F is given by
VF=A[V(PG)]A'. The goodness of fit for this model is QW =
6.25, with d.f.=9, thus indicating an adequate fit. This is
support for the assumption that the same parameters
TT 73 ,11'74,11'75 represent the probability of reporting a cold in
1973,1914, and 1975 respectively for each of the
subpopulations in which they apply. The hypothesis that
these probabilities are the same for each of the years,
was investigated with a test of the form C~=O, where
104
r ,11 -1 01
C = 11 0 -11L J
The resulting test statistic QC=17.24 (d.f.=2) is
significant, a=.05, and thus this hypothesis is rejected.
Table 3.5.2 contains the estimated probabilities n 73 ,n74 and
n 75 , as well as their standard errors. For comparative
purposes, these estimates are also presented for analogous
analyses when only the subjects from the compete data
subpopulations were included, and also for the analysis for
the data coming from the 1 and 2 year subpopulations only.
Table 3.5.2Estimated Probabilities and Standard Errors
Subjects With Subjects from OneParameter All Subjects Complete Data or Two Years
TT 73 .599 .587 .523( .013) ( .020) ( .023)
n 74 .635 .648 .515( .011) ( .020) ( .024)
TT 75 .574 .560 .469( .011) ( .020) ( .019)
3.6 Summary
Weighted least squares methodology is appropriate for
a variety of analysis situations for categorical data. The
function vector of interest can take a broad range of forms,
depending on the hypotheses to be addressed, the data
structures, and the strength of the sample sizes. Linear
models fit to the function vector can provide a description
105
of the variation among the function estimates as well as
providing a framework in which to test the hypotheses of
interest. Section 3.4 discussed the specific data situation
of repeated measurements, and the use of linear models for
categorical data to analyse such datasets. The chapter
ended with a discussion of two strategies for dealing with
incomplete data in the categorical data setting.
The CHESS dataset provides an opportunity to
illustrate the application of weighted lea~t squares
methodology as described in this chapter to a large, complex
dataset. Chapter V will be concerned with the analysis of
the complete data. Linear model descriptions of several
functions of interest will be pursued for a
crossclassification of estimates identified by a variable
selection scheme described in Chapter IV. Chapter VI will
address the analysis of the incomplete data via the
application of multivariate ratio estimation and
supplemental margins techniques.
CHAPTER IV
ANALYSIS OF COMPLETE DATA: VARIABLE SELECTION
4.1 Introduction
The approach to the analysis of the complete data,
consisting of those 4002 individuals with responses for each
of the nine time points, will consist of two stages as
discussed in the beginning of Chapter II. The first
involves the use of randomization test statistics to
determine the extent of association in the data between the
response variables and evaluation variables, including both
the variable indicating the pollution level and also those
of a demographic nature. Once the major sources of
variation for the response variables are determined,
estimates corresponding to the cross-classification of those
variables can be produced and the variation among those
estimates modeled in the context of weighted least squares.
One can thus generate a linear model description of the
variation, including the estimation of model parameters.
Clearly, variable selection procedures can be useful in the
first phase of such an analysis.
Randomization test statistics including extensions of
the Mantel-Haenszel methodology can be useful tools in
. choosing among a set of explanatory variables those which
maximally explain the variation in a set of response
107
variables. Sections 4.2 and 4.3 describe an eXisting
strategy for variable selection in the case of categorical
data analysis and also an extension to the situation where
the response is multivariate. Section 4.4 will be concerned
with the application of this strategy to the complete data
to determine the explanatory variables for the subsequent
linear models analyses which will be discussed in Chapter V.
4.2 Variable Selection for Categorical Data
The statistician is often faced with a large number of
explanatory variables in a dataset with which to explain the
variation in the outcomes for the response measure or
measures. Especially when one is dealing with a very large
dataset and/or a great many variables, it may be reasonable
to limit the modeling phase of the analysis to a subset of
the variables, for both monetary and computational
considerations. This can be very important in categorical
data analysis, since weighted least squares analysis
generally involves modeling the estimates produced by a
cross-classification of the explanatory variables. If these
become too great in number, many of the cross-classification
cells will have inadequate sample sizes for the analysis to
proceed. Other modeling procedures commonly used for
categorical data rely on iterative algorithms to produce
maximum likelihood estimates. These can become
prohibitively expensive if a large number of variables are
involved.
Higgins and Koch (1977) describe a procedure for
108
variable selection with categorical data which has also been
described in Clarke and Koch (1976). The procedure is
somewhat similar to forward stepwise regression as used with
continuous data. Higgins and Koch applied their method to
an environmental dataset in which the response measure of
interest was dichotomous--the presence or absence of
byssinosis in cotton textile workers. The explanatory
variables included measures pertaining to work conditions
and demographics. The purpose of their procedure is to find
a subset of variables responsible for the most variation in
the response outcomes among the subpopulations defined by
the cross-classification of such variables.
First, Pearson chi-square statistics are computed for
the first order association between the response variable
and each of the explanatory variables. The statistic, Qp'
is divided by its degrees of freedom, and the first variable
selected is that with the largest chi-square per degree of
freedom. The process continues with the calculation of Qp
for the cross-classification of the response variable with
all two-way combinations of the first variable selected and
all the remaining variables. Qp/d.f. is determined for each
of these contingency tables, and again, the variable with
the highest chi-square per degree of freedom is eligible for
selection.
The authors discuss two possible termination
statistics. Termination statistic "a " is the sum of the
Pearson chi-square statistics for the association of the
109
response variable and eligible explanatory variables at each
realization of the possible combinations of the levels of
the variables already selected. Termination statistic 'b'
is the randomization statistic QAR' where the explanatory
variable is the eligible variable, and the q strata refer to
the possible combinations of the variables already selected.
The advantage of the termination statistic "a" is said to be
that it assesses total association; its weaknesses are that
as the selection process goes on, its degrees of freedom
become so great that it becomes extremely stringent and the
data become so sparse that its validity as a chi-square
statistic becomes suspect. Termination statistic "b ll, as
the extended Mantel-Haenszel statistic, detects average
partial association and will have less stringent sample size
requirements.
Whatever termination statistic is employed, the
criterion for failure to include is a previously-decided
significance level, usually 0=.05 or 0=.10. At that point,
one is considered to have a reasonable set of variables for
further analysis since none of the remaining variables have
a significant effect once the selected variables have been
taken into account. While other reasonable subsets of
variables may also exist, the motivation for the Higgins
Koch algorithm is to find one which maximizes the variation
among the response variable outcome levels; thus it can be
considered of particular interest. It should also be noted
that, since the variable selected at each stage is the one
110
which in combination with other selected variables maximizes
the total variation among the corresponding outcome variable
levels relative to their degrees of freedom, the selection
criterion can be considered analogous to maximizing a mean
square due to regression in regression analysis.
4.3 Variable Selection Extended to Multivariate Response
. Profiles
One implied and implemented extension of the Higgins
Koch strategy is to datasets in which the response variable
is not dichotomous, but has more than two outcome levels.
Another extension which follows logically is to the
situation where one is interested in more than one dependent
variable at one time. An example of such situations is
repeated measurement designs in which one is also involved
in assessing variation of a response over time. In the
CHESS datasets, there is interest in determining a subset of
explanatory variables with which to form a cross
classification of estimates for the multivariate profile
corresponding to whether or not individuals had 2 or more
colds (2 + colds) in 1973, 1974, or 1975.
Questions of this kind can be addressed with the
multivariate randomization statistics discussed in Chapter
II. Variable selection can proceed along the same lines as
in section 4.2, with the screening process for explanatory
variables intended to find those with the maximum variation
among a set of multivariate profiles, rather than among the
levels of a single outcome variable. The strategy employed
111
in this chapter is to first calculate QAR for association
of the explanatory variables with the vector of summary
measures calculated from the response profiles. Then, QAR
is divided by its degrees of freedom, and the variable which
is significant at a=.05 and also has the largest chi-square
per degree of freedom is selected.
The procedure continues with construction of strata
for the levels of the variables already selected. The
statistic QAR for the average partial association of the
remaining explanatory variables on the dependent variable
summary measures across the strata is calculated for each
one. The variable with the largest chi-square per degree of
freedom which is also significant at the chosen a level is
selected as the next variable. Note that this process is
effectively the same as calculating all possible termination
statistics "b" in the Higgins and Koch paper (1977) and
bypassing the phase where one calculates a chi-square
statistic for an overall table where the subpopulation
levels are combinations of selected variables.
While the use of an average partial association
statistic may have the usual limitations, its use over the
total association statistic QTR seems defensible on the
following grounds. QTR wouid lose its effectiveness early
since its degrees of freedom would be q*(s-l)*u and would
become so great so quickly that QTR would be excessively
stringent as a screening device. Also, one would run into
sample size problems as well, since QTR
is linked to
112
individual stratum sizes instead of total strata sizes as is
QAR. Since the use of multivariate profiles tends to thin
data out quickly, sample size considerations are critical
for multivariate analyses.
The process continues with the construction of QAR for
the average partial associ~tion of the unselec~ed variables
over strata formed from the combination of the levels of the
selected variables until one has a subset of variables which
maximize the variation among the multivariate profile
summary measures simultaneously, and can proceed with the
linear modeling of such variation.
4.4 Application of Multivariate Variable Selection to Chess
Data for 2+ Colds and 1+ Asthma in 1973,1974, and 1975
As described in Chapter I, the health status
information collected was whether or not each child was
experiencing cold or asthma symptoms at each of the nine
time points of the CHESS study. Since the data were
separated into three yearly components, 1973, 1974, and
1975, summary measures were created for preliminary
descriptive purposes which included the total number of
colds or asthma or both, for each year. So, there are a
number of potential response measures to analyse, and the
choice necessarily depends on the purpose of the analysis as
well as the subjects it covers.
Table 4.4.1 displays the distribution of responses by
each year and overall for both colds and asthma.
113
Table 4.4.1
Frequency of Colds and Asthma in 1973,1974, and 1975
ColdsFrequency Overall 1973 1974 .1973
AsthmaOverall 1973 1974 1975
0 688 1936 1788 1858 3634 3801 3802 37471 875 1247 1271 1242 142 102 101 842 752 633 687 639 54 40 34 583 603 186 256 263 37 59 65 1134 443 265 272 236 192 157 112 168 48 159 17 40
Four summary measures were considered reasonable to
construct: one indicating whether subjects had 2 or more
colds in a given year, one indicating whether subjects had
one or more asthma events in a given year, one indicating
whether subjects had 0-1,2-3, or 4+ total colds and one
indicating whether subjects had 0 or 1 total asthma events.
Also constructed were the mean number of colds, and mean
number of asthma events for each year. Thus, there are six
reasonable response variables to analyse, three each for
asthma and colds. Those relating to total colds or total
asthma events over the three years obviously rule out the
possibility of assessing time as a source of variation, but
their consideration does provide a point of comparison for
the advantages of the multivariate analyses which do.
The primary focus of the analysis efforts for the
complete data is to assess the variation among those who had
114
2+ colds (and separately, 1+ asthma) in the three years
represented by the complete data. The variables available
in this dataset are as follows: SEX, AREA, RACE2 (white
versus other), RACE3 (white versus black versus other), AGE2
(younger versus older), AGE4 (four age groupings), and SPOLL
(low, medium, or high) pollution levels. There is also
interest in whether there was any variation over time. The
first stage involves using the variable selection scheme
presented in preceding sections of this chapter to determine
the appropriate set of explanatory variables. The second
stage involves the description of the variation with a
linear model. One of the objectives of this dissertation is
to describe a systematic process by which to accomplish both
of these stages, while being as efficient as possible with
respect to the size of the dataset.
The question of whether there is variation in the
distribution of 2+ colds (1+ asthma) across the three years
can first be addressed within the context of multivariate
randomization tests by creating summmary measures which
reflect the differences in the proportions of the outcome
between the years. If there is a difference, then one has to
incorporate time into subsequent anal~ses, either through
the use of multivariate response measures in the
randomization tests used as part of variable selection
strategies or time parameters in subsequent weighted least
squares modeling. Table 4.4.2 presents the results of
multivariate randomization tests for the association between
115
Table 4.4.2
Multivariate Randomization Tests for the Association of theExplanatory Variables with the Difference in 2+ Colds and
1+ Asthma Between 1973-1974, and 1974-1975
2+ ColdsVariable OAR d.f. P-Value
AREA 53.81 10 0.0000SEX 2.31 2 0.3146RACE2 3.53 2 0.1716RACE3 12.15 4 0.0162AGE2 5.36 2 0.0685AGE4 14.53 6 0.02'42SPOLL 7.62 4 0.1066
1+ AsthmaVariable OAR d.f. P-Value
AREA 8.81 10 .5499SEX 7.61 2 .0222
e RACE2 1. 72 2 .4242RACE3 2.40 4 .6626AGE2 1.10 2 .5763AGE4 1. 24 6 .9750SPOLL 3.25 4 .5166
the evaluation variables and two measures created from the
multivariate profiles: the first is the difference between
the proportion of those having 2+ colds (1+ asthma) in 1973
and 1974, and the second is the difference for 1974 and
1975. These summary measures were constructed via an A
matrix of the form
r ,10 -1 0 -1 1 0 1 0 I
A = I I10 0-1 -1 1 1 0 0 IL ~
QAR is the multivariate mean score statistic of interest,
116
with d(s-l} degrees of freedom, where d=2 for the two
response measures created. 0=[1 5 _ 1 ,-1], where s is the
number of levels of the evaluation variable.
Four of the seven tests indicate that a time x
evaluation variable interaction exists at the a=.05 level of
significance for 2+ colds, while only sex has a significant
time interaction for the 1+ asthma outcome measures
(p=.0222). However, one can conclude from these tests that
both 2+ colds and 1+ asthma interact with time for one or
more of the explanatory variables, and hence, that time
should be incorporated into the future analysis structure.
The variable selection thus proceeded by focusing on the
three proportions of 2+ colds (and 1+ asthma) for 1973,
1974, and 1975 simultaneously.
The same analyses were repeated, using
r ,I 0 1 0 1 0 1 0 1 II I
A = I 0 0 1 1 0 0 1 1 II II 0 0 0 0 1 1 1 1 IL oJ
to create summary measures which are the proportions of 2+
colds in 1973, 1974, and 1975. Table 4.4.3 shows the
relative ranking of the explanatory variables for QAR
divided by its degrees of freedom. The table also includes
the same information for the dependent variable 1+ asthma .
. SEX emerges as the first variable to include for both 2+
colds and 1+ asthma, as its chi-square per degree of freedom
117
is over five times as great as that of the next contending
variable for colds and over twice as much for asthma. QAR
Table 4.4.3
Relative Ranking of Explanatory Variables with RespectTo Their Association with 2+ Colds and 1+ Asthma in 1973
1974, and 1975 as Determined by QAR/d.f.
2+ ColdsVariable QAR d. f. P-Value QAR/d • f .
SEX 126.06 3 0.000 42.02RACE2 23.35 3 0.000 7.78AREA 104.86 15 0.000 6.99RACE3 33.22 6 0.000 5.54AGE2 9.25 3 0.026 3.08AGE4 21.17 9 0.012 2.35SPOLL 8.62 6 0.196 1.44
Variable QAR d. f. P-Value QAR/d . f .
e SEX 14.07 3 0.0028 4.69AREA 22.16 15 .1037 1.48SPOLL 8.00 6 .2381 1. 33RACE2 2.82 3 .4198 .94AGE2 1.92 3 .5894 .64RACE3 3.61 6 .7288 .60AGE4 2.85 9 .9699 .32
is significant for both colds and asthma at Q=.05 level of
significance. It should be noted that all of the tests are
significant (Q=.05) for colds except for the pollution
index, whereas SEX is the only significant variable
according to the same criterion for 1+ asthma.
The next step of the analysis was to assess the
strength of the association between the explanatory
variables and the response measures while controlling for
SEX. Strata were formed corresponding to males and females
118
and the average partial association statistics QAR computed.
Table 4.4.4 displays the relative ranking of the explanatory
variables in this second phase of screening according to
the same criterion QAR/d.f., where the degrees of freedom is
still going to be (s-l)*d. The table includes both 2+ colds
and 1+ asthma, since both dependent variables had SEX enter
as the first variable selected in the screening process.
Table 4.4.4
Relative Ranking of Explanatory Variables with Respect toTheir Association with 2+ Colds and 1+ Asthma in 1913,
1914, and 1915 as Determined by QAR/d.f. Controlling for Sex
2+ ColdsVariable OAR d. f. P-Value °AR/d . f •
RACE2 23.23 3 0.000 1.14 eAREA 100.08 15 0.000 6.61RACE3 33.36 6 0.000 5.60AGE2 1.15 3 0.052 2.58AGE4 18.46 9 0.030 2.05SPOLL 8.30 6 0.211 1.38
1+ ASTHMA
VARIABLE OAR d. f. P-VALUE °AR/d • f •
AREA 21. 23 15 .1295 1.42SPOLL 1.12 6 .2592 1. 29RACE2 2.13 3 .4355 .91AGE2 1. 86 3 .6011 .62RACE3 3.54 6 .1384 .59AGE4 2.81 9 .9114 .31
RACE2 becomes the second variable selected for the
yearly outcomes concerning 2+ colds with a QAR/d.f. of
1.143, compared to the next highest criterion of 6.612 for
AREA. The three level classifier for race, RACE3, had the
119
third highest criterion of 5.560. All of these variables
also had significant QAR at a=.05. The pollution index
variable again showed no association with a QAR of 8.30 and
6 d.f.(p=.2173). None of the explanatory variables for 1+
asthma are significant, although AREA has the highest chi
square per degree of freedom criterion of 1.42. Its p-value
for QAR is .1295, which may be considered borderline
significant for an a=.10 level and an analysis which has
screening as its primary objective rather than strict
hypothesis-testing.
Consequently, strata corresponding to a SEX x RACE2
crosstabulation were constructed, and QAR calculated for the
partial association of the remaining explanatory variables
with the proportion of 2+ colds in 1973,1974, and 1975,
adjusting for SEX and RACE2. The results of these
computations are shown in Table 4.4.5. AREA emerges as the
next variable to enter, with a chi-square per degree of
freedom of 6.65. The next largest value was 2.51 for AGE2.
No further work was done for 1+ asthma.
At this point, three variables have been selected by
the variable selection process for their association with 2+
colds --SEX, AREA, and RACE2, and two variables have been
selected for 1+ asthma--SEX and AREA. The selection process
could be carried forward one more time for 2+ colds to
assess whether the remaining explanatory variables were
associated with 2+ colds after controlling for the first
three variables selected. However, it is doubtful whether
120
this data could support stratification into more than 24
subpopulations for a linear models analysis. As a general
rule, the selection procedure for categorical data will
usually involve stopping after three variables have been
selected for this reason as well as for computational
considerations involving the subsequent analyses. Thus,
although the next step was completed and did not result in
any further significant associations, it well may have been
left out regardless.
Note that the pollution index variable SPOLL did not
survive the variable selection process. No first order
association between SPOLL and either 1+ asthma or 2+ colds
was found, and no association was found when other
explanatory variables were adjusted for in subsequent
analyses. Since the index is ordinally scaled, a more
appropriate test statistic would be the correlation
statistic discussed in Chapter II for assessing the
association with SPOLL as well as the other ordinally scaled
variables AGE2 and AGE4. This analysis was performed but,
but no different results were obtained. This is not really
surprising, given the fact that the quality of the pollution
data which the index was based on was debatable, as
discussed in Chapter I.
121
Table 4.4.5
Relative Ranking of Explanatory Variables with RespectTo Their Association with 2+ Colds in 1973,1974, and 1975as Determined by QAR/d.f. Controlling for Sex and Race2
2+ ColdsVariable Q
AR d. f. P-Value QAR/d . f .
AREA 97.44 15 0.000 6.50AGE2 7.52 3 0.057 2.51AGE4 18.38 9 0.031 2.04SPOLL 11.27 6 0.804 1.88
CHAPTER V
LINEAR MODELS ANALYSIS OF COMPLETE DATA
This chapter is concerned with the application of
weighted least squares methodology to the analysis of the
health status measures in the CHESS study for the children
with complete data. Variables selected as a consequence of
the randomization analyses of Chapter IV are the basis of
the subpopulations undeF study. Functions examined are the
proportion of children reporting one or more incidents of
asthma in a given year, the proportion of children reporting
two or more colds in a given year, the proportion of
children ever reporting one or more incidents of asthma per
year, and mean colds in a given year. Sections 5.1 and 5.2
discuss different alternatives for data adjustments to be
made when existing zero frequencies would induce covariance
matrix singularities. Additionally, Section 5.2 includes a
discussion of the use of residual analysis in model
selection for categorical data analysis.
5.1 Linear Model Analysis of 1+ Asthma Data
The variable selection in the previous chapter for the
proportion of 1+ asthma in 1973, 1974, and 1975 identified
the variables SEX and AREA as accounting for most of the
variation among the estimates of interest. Table 5.1.1
contains the cross-classification of those subjects having
Table 5.1.1
1+ Asthma Reported in 1973, 1974, and 1975
73 Marginals73 73 74 74
None 73 74 74 75 75 75 75 Total 73 74 75
Male
e Charlotte 185 3 0 1 3 1 5 1 199 6 7 10Birmingham 461 12 3 1 17 1 6 15 516 29 25 39NYC 76 1 1 0 0 0 0 0 78 1 1 0Utah 319 4 5 2 11 0 3 5 349 11 15 19California I 543 7 4 2 16 5 9 26 522 40 41 56California II 249 3 7 1 9 1 4 14 288 19 26 28
FemaleCharlotte 229 4 3 0 3 1 0 8 248 13 11 12Birmingham 543 11 5 1 1.4 2 4 9 ..589 23 19 29NYC 73 1 0 0 2 0 0 2 78 3 2 4Utah 353 6 2 1 7 1 2 14 386 22 19 24California I 418 7 6 3 6 4 2 11 457 25 22 23California II 275 3 2 1 2 0 4 5 292 9 12 11
123
124
1+ asthma by sex, area and year. Each of the columns
represents the number of individuals who reported asthma in
that period of time. 'None' means that no occurrences were
reported at any of the nine time points, '1974' entries are
those who reported an incident in 1974 only, '74,75'
entries represent subjects who reported incidents in both
1974 and 1975, while the column for '73-75' includes the
numbers of subjects who reported asthma in each of the three
years. The table also includes marginal frequencies for
each of the-three years. As might be expected, the 'None'
column dominates the table, containing approximately ninety
percent of the data.
The subjects in the sex x area strata are considered
to be conceptually representative of some larger
subpopulation in a sense equivalent to simple random
sampling. Their response profiles are assumed to be
independent, so that the data of Table 5.1.1 have the
product multinomial distribution, i.e.
where whij denotes the probability that a randomly selected
subject from the h-th sex, and i-th area has the j-th
profile; Yhij denotes the frequency of the h-th sex, i-th
area and j-th response profile in the sample of size nhi ..
Also, h=1,2 for male and female respectively, i=1,2 ... 6 for
the six areas, and j=1,2 ... 8 for the eight response profiles
depicted in Table 5.1.1. The functions of interest are the
125
first order marginal probabilities of reporting asthma in
197.3, 1974, "or 1975 for each combination of the levels of
SEX and AREA. Estimators for these linear functions can be
expressed in matrix notation as
r ,10 1 0 1 0 1 0 11I I
F=F(p)=Ap= 10 0 1 1 0 0 1 11 e I * pI I 1210 0 0 0 1 1 1 11L. .I
where P=(Yhij/nhi .) is a (96 x 1) vector of sample
proportions. Note that the data are relatively sparse due
to the domination of the 'None' response profile; this is
particularly true for the two NYC subpopulations. Since
forming A*p results in zero marginal probabilities for NYC
males, .5 was added to the 1975 cell so that the
computational strategy would be feasible (as suggested in
Koch, et. al. 1977). However, it should be pointed out that
the marginal frequencies for NYC are less than the 5 usually
considered a reasonable requirement in order to assume that
the proportion vector p has an approximate multivariate
normal distribution. The other 33 marginal frequencies are
all greater than 5. A consis"tent estimator of the
covariance matrix of F is given by VF=AVpA 1• VF has a
block diagonal structure which takes into account the
correlation among the functions produced from each
sUbpopulation.
A useful procedure for investigating the variation
126
among these estimates is to use the cell mean model to gain
a preliminary assessment. This model is formally stated as
E{F(p)}= An=X~=I~=~
The resulting parameter estimates b of ~ are thus the
function estimates and the corresponding estimated
covariance matrix Vb=VF . Since X has rank t=36, there is no
reduction in dimension and thus no lack of fit defined.
However, Wald statistics can be employed to investigate
linear hypotheses concerning~. Model-fitting efforts can
then continue with the model structures implied by the
results of such hypothesis-testing. The linear hypotheses
of interest and their corresponding Wald statistics
Qc=blCl(CVbCl)-lCb are displayed in Table 5.1.2.
Table 5.1.2Hypotheses and Resulting Test Statistics Concerning
Proportions of 1+ Asthma Reported in 1973, 1974, and 1975
Hypothesis D.F.P-Value
1. No difference between sexes for 1.97 1averages over (area x time)
2. No variation among areas for averages 15.89 5over (sex x time)
3. No variation among times for averages 14.02 2over (sex x area)
4. Homogeneity across areas of differences 17.27 5between sexes for averages across time
5. Homogeneity across times of differences 4.26 2between sexes for averages across area
6. Homogeneity across areas for differences 8.57 10across time for averages across sex
7. No time x sex x area interaction 6.78 10
Clearly, the three-way interaction is non-significant
.160
.007
.000
.004
.119
.573
.746
(~=.05), and so is the (area x time) interaction. However,
the (sex x area) interaction is significant, and the (sex x
127
time) interaction is suggestive. A model which incorporates
these results is X2 ' which includes separate intercepts for
each of the six areas, effects for sex within each area,
and separate time effects for each sex. Formally, this
model is stated as
E{F(p)}=F{n}=X2~2
where X2 is the (36 x 16) design matrix displayed in Table
5.1.3a. Table 5.1.3b contains the parameter estimates b 2
and their standard errors. The Wald goodness-of-fit
-1statistic for the model X2 is Qw=(F-X2b 2 )'VF (F-X2b 2 ) =
18.42(d.f.=20), and its non-significance(p=.56) supports the
model. Table 5.1.3b also contains Wald statistics
corresponding to linear hypotheses concerning~. ~1 is the
predicted reference value for males in Charlotte in 1973.
~2-~6 are the corresponding values for Birmingham, NYC,
Utah, California I and California II. ~7 represents an
incremental effect for females in Charlotte, and ~8-~12 are
similar effects for the other areas. ~13 is the incremental
effect for the year 1974 for males, ~14 is the
corresponding effect for 1975, and ~15 and ~16 are the time
effects for the years 1974 and 1975 for the females
re~pectively.
Thus, the significant (sex x area) interaction of the
previous preliminary modeling stage is accounted for in the
model X2 with separate sex effects for each area. The test
for no (sex x time) interaction 1n the model X2
(~13=~15'~14=~16) resulted 1n a Wald statistic Qc=8.69
128
Table 5.1. 3a
Specification Matrix for Reduced Model Xz for 1+ Asthma
Specification Matrix X2
1 0 0 0 0 0 0 0 0 0 0 0 o 0 0 01 0 0 0 0 0 0 0 0 0 0 0 1 0 0 01 0 0 0 0 0 0 0 0 0 0 0 o 1 0 00 1 0 0 0 0 0 0 0 0 0 0 o 0 0 00 1 0 0 0 0 0 0 0 0 0 0 1 0 0 00 1 0 0 0 0 0 0 0 0 0 0 o 1 0 00 0 1 0 0 0 0 0 0 0 0 0 o 0 0 00 0 1 0 0 0 0 0 0 0 0 0 1 0 0 00 0 1 0 0 0 0 0 0 0 0 0 o 1 0 00 0 0 1 0 0 0 0 0 0 0 0 o 0 0 00 0 0 1 0 0 0 0 0 0 0 0 1 0 0 00 0 0 1 0 0 0 0 0 0 0 0 o 1 0 00 0 0 0 1 0 0 0 0 0 0 0 o 0 0 00 0 0 0 1 0 0 0 0 0 0 0 1 0 0 00 0 0 0 1 0 0 0 0 0 0 0 o 1 0 00 0 0 0 0 1 0 0 0 0 0 0 o 0 0 00 0 0 0 0 1 0 0 0 0 0 0 1 0 0 00 0 0 0 0 1 0 0 0 0 0 0 o 1 0 01 0 0 0 0 0 1 0 0 0 0 0 o 0 0 01 0 0 0 0 0 1 0 0 0 0 0 o 0 1 0 e1 0 0 0 0 0 1 0 0 0 0 0 o 0 0 10 1 0 0 0 0 0 1 0 0 0 0 o 0 0 00 1 0 0 0 0 0 1 0 0 0 0 o 0 1 00 1 0 0 0 0 0 1 0 0 0 0 o 0 0 10 0 1 0 0 0 0 0 1 0 0 0 o 0 0 00 0 1 0 0 0 0 0 1 0 0 0 o 0 1 00 0 1 0 0 0 0 0 1 0 0 0 o 0 0 10 0 0 1 0 0 0 0 0 1 0 0 o 0 0 00 0 0 1 0 0 0 0 0 1 0 0 o 0 1 00 0 0 1 0 0 0 0 0 1 0 0 o 0 0 10 0 0 0 1 0 0 0 0 0 1 0 o 0 0 00 0 0 0 1 0 0 0 0 0 1 0 o 0 1 00 0 0 0 1 0 0 0 0 0 1 0 o 0 0 10 0 0 0 0 1 0 0 0 0 0 1 o 0 0 00 0 0 0 0 1 0 0 0 0 0 1 o 0 1 00 0 0 0 0 1 0 0 0 0 0 1 o 0 0 1
129
Table 5.1.3b
Estimated Parameters, Standard Errors and TestStatistics for Reduced Model Xl for 1+ Asthma
Parameter InterpretationSl: Predicted value for 1973 males for CharlotteS2: Predicted value for 1973 males for BirminghamS3: Predicted value for 1973 males for NYCS4: Predicted value for 1973 males for UtahS5: Predicted value for 1973 males for California IS6: Predicted value for 1973 males for California IIS7: Incremental value for 1973 females for CharlotteS8: Incremental value for 1973 females for BirminghamS9: Incremental value for 1973 females for NYCS10: Incremental value for 1973 females for UtahSll: Incremental value for 1973 females for California IS12: Incremental value for 1973 females for California IIS13: Incremental value for 1974 malesS14: Incremental value for 1975 malesS15: Incremental value for 1974 femalesS16: Incremental value for 1975 females
Estimates andStandard Errors
.0304 ± .0102
.0506 ± .0090-.0203 ± .0072
.0334 ± .0085
.0769 ± .0109
.0722 ± .0140
.0191 ± .0160-.0106 ± .0112
.0341 ± .0196
.0233 ± .0136-.0247 ± .0142-.0386 ± .0168
.0036 ± .0046
.0216 ± .0054
.0058 ± .0039
.0018 ± .0043
Wald Test Statistics For HvpothesesNo area variation for sex increments
(S7 = Pe = Ss = SlO = Sll = Sll)No time effect for males for 1974 (S13=0)No time effect for males for 1975 (S14=0)No time effect for females for 1974 (SlS=O)No time effect for females for 1975 (SlS=O)No (sex x time) interaction
QC
17.19.62
16.332.17
.1826.86
d. f.
511114
Goodness of Fit Qw = 18.42 d.f. = 20
130
with d.f=2; this is significant at (Q=.05). Hypotheses
concerning the individual time effects were then tested;
the effect for males in 1975 proved to be the only
significant one(Q=.05). The hypothesis of equality of (area
x sex) effects (~7=~8=~9=~10=~11=~12) was also assessed,
and proved to be contradicted (QC=17.19, with 5 d.f.).
In view of the significant (sex x area) and (sex x
time) interactions, it is of interest to describe their
nature. This will be done with subsequent models. These
models should be viewed as descriptive tools, inferentially
justified by the model X2 . P-values resulting from tests on
model parameters are used as guidelines for further
smoothing rather than as outcomes in formal hypothesis
testing. Accordingly, the model X3 was fitted, which
included a reduction of the 16 parameters to 13, excluding
the 1974 time increments and the 1975 increment for females.
The goodness-of-fit criterion for X3 is Qw = 23.82
(d.f.=23,p=.41), which is clearly supportive of the model.
Table 5.1.4 displays the parameter estimates for this model
as well as the resulting Wald statistics for linear
hypotheses concerning them. X3 has the same structure as X2
except that the columns corresponding to ~13' ~14' and ~15
have been removed. The Wald statistic suggests that a final
model can further reduce dimensionality. The reference
parameters for the California areas are not significantly
different from each other, and neither are the sex effects
for NYC and Utah nor the sex effects for the California
Table 5.1. 4
Estimated Parameters and Test Statistics forthe Reduced Model X3 for 1+ Asthma
Parameter Interpretation
I : Predicted value for males in 1913 &: 1914 in Charlotte2 : Predicted value for males in 1913 &: 1914 in Birmingham3 : Predicted value for males in 1913 &: 1914 in NYC4 : Predicted value for males in 1913 &: 1914 in Utahs : Predicted value for males in 1913 &: 1974 in Cali I6 : Predicted value for males in 1913 &: 1914·1n Cali II7 : Increment for females in 1913 &: 1914 in Charlottea : Increment for females in 1913 &: 1914 in Birmingham9 : Increment for females in 1913 &: 1974 in NYCI 0 : Increment for females in 1913 &: 1914 in UtahI I : Increment for females in 1913 &: 1914 in Cali II 2 : Increment for females in 1913 &: 1914 in Cali III 3 : Increment for males for 1915
ESTIMATES ANDSTANDARD ERRORS
I : .0320 ± .01002 : .0521 ± .00863 : .0001 ± .00664 : .0348 ± .0083s : .0181 ± .01066 : .0135 ± .01397: .0155 ± .0156a: -.0150 ± .01019: .0266 ± .019110 : .0200 ± .0133II: -.0281 ± .013112: -.0395 ± .016613: .0192 ± .0044
Wald Test Statistics For Simplifications
Equality of reference parameters for California I andCalifornia II, equality of sex effects for NYC and Utah(Ss = S6, S9 = SIO, S 1 I = S12) QC = 1.91 df = 3
r 0 0 0 0 1 -1 0 0 0 0 0 0 0 1C1 = 0 0 0 0 0 0 0 0 1 -1 0 0 0
L 0 0 0 0 0 0 0 0 0 0 1 -1 0 JNull sex effects for Charlotte, Birmingham, NYC and Utah(S7 = Sa = S9 = Sl 0 ) QC = 1.01 df = 4
r 0 0 0 0 0 0 1 0 0 0 0 0 0 1C2 = 0 0 0 0 0 0 0 1 0 0 0 0 0
l 0 0 0 0 0 0 0 0 1 0 0 0 0J0 0 0 0 0 0 0 0 0 1 0 0 0
131
132
areas (Q=.05). The tests supporting these conclusions were
simultaneously assessed with contrast matrix 01 shown in
Table 5.1.4. An additional hypothesis investigated was
whether the sex effects for Charlotte, Birmingham, NYC and
Utah were simultaneously null. The Qc for this test was
7.07 (d.f.=4), and thus these sources of variation can be
eliminated.
A final model X4 was then fitted which took into
account the results of the statistical tests concerning the
parameters of intermediate model X3 . The parameters for X4
are seven in number, and include predicted reference
parameters for subjects in Charlotte, Birmingham, NYC and
Utah in 1913 and 1914, and another for males in California
in 1973 and 1914. There are two additional parameters. One
is an incremental effect for females in California and the
other is an overall incremental effect for 1975 for males.
The goodness-of-fit statistic for X4 is 32.11 with 29 d.f.
Its non-significance(Q=.25) supports model X4 as adequately
describing the functions F(p).
Thus, the variation among the subpopulation estimates
for 1+ asthma is mostly attributable to differences in
areas. The (sex x area) interaction surfacing in the
results of Table 5.1.2 is isolated to California with this
model. The (sex x time) interaction is accounted for with
an effect for 1975 which is limited to males. Thus, the
effect of time in the estimates for 1+ asthma is very
limited. The estimates for ~4' the design matrix X4 , and
Table 5. 1. 5a
Estimated Parameters and Standard Errorsfor Final Model x.- for 1+ Asthma
Specification Matrix x.-l 0 0 0 0 0 01 0 0 0 0 0 01 0 0 0 0 0 10 1 0 0 0 0 00 1 0 0 0 0 00 1 0 0 0 0 10 0 1 0 0 0 00 0 1 0 0 0 00 0 1 0 0 0 10 0 0 1 0 0 0 Estimates and0 0 0 1 0 0 0 Standard Errors0 0 0 1 0 0 10 0 0 0 1 0 0 1 .0384 ± .00770 0 0 0 1 0 0 z .0430 ± .00510 0 0 0 1 0 1 3 .0034 ± .00620 0 0 0 1 0 0 4 .0426 ± .00650 0 0 0 1 0 0 s .0769 ± .00850 0 0 0 1 0 1 s -.0343 ± .01051 0 0 0 0 0 0 7 .0182 ± .00431 0 0 0 0 0 0
e 1 0 0 0 0 0 00 1 0 0 0 0 00 1 0 0 0 0 00 1 0 0 0 0 00 0 1 0 0 0 00 0 1 0 0 0 00 0 1 0 0 0 00 0 0 1 0 0 00 0 0 1 0 0 00 0 0 1 0 0 00 0 0 0 1 1 00 0 0 0 1 1 00 0 0 0 1 1 00 0 0 0 1 1 00 0 0 0 1 1 00 0 0 0 1 1 0
Parameter Interpretations
SI: Predicted value for CharlotteSz: Predicted value BirminghamS3: Predicted value NYCS4: Predicted value for UtahSs: Predicted value for CaliforniaSs: Increment for females for CaliforniaS7: Increment for 1975 for males
Goodness of fit: Qw = 32.71 d.f.=29 (p-value=.2894)
133
Table 5.1. 5b
Observed and Predicted Estimates of First OrderMarginal Probabilities for 1+ Asthma
134
135
the values of the original estimates for 1+ asthma as well
as their predicted values F(p)=X4b 4 are displayed in Table
5.1.5a and 5.1.5b.
5.2 Linear Model Analysis of 2+ Colds for 1973, 1974,
and 1975 for the Sex x Race x Area Cross-Classification
Variables for sex, area, and race (SEX, AREA, RACE)
were selected for inclusion in the modeling for 2+ colds in
1973, 1974, and 1975 in the variable selection process
described in Chapter IV. Table 5.2.1 displays the results
of cross-classifying the sUbjects by sex, race, and area
according to their response profiles. The table also
contains the marginal frequencies for having 2+ colds in
1973, 1974, and 1975. The data are distributed somewhat
more uniformly across the response profiles than they were
for 1+ asthma; however, the 'None' category still dominates
the distribution. Also, the addition of race to the cross
classification means that the number of subpopulations under
investigation doubles to 24 from the 12 studied in section
5.1. It is doubtful whether the inclusion of still another
cross-classification variable could have been addressed
within this analysis framework.
Similar to the previous analysis, the functions being
modeled are the first order marginal probabilities of
reporting colds in 1973, 1974, and 1975 for each
sUbpopulation corresponding to all possible combinations of
sex, race and area. The marginal frequencies which are the
numerators of such functions are displayed in Table 5.2.1
Table 5.2.1
Frequency of 2+ Colds Reported in 1973, 1974, and 1975
136
73 Marginals73 73 74 74
None 73 74 74 75 75 75 75 Total 73 74 75-- -- --
White MalesCharlotte 71 15 19 9 10 2 7 3 136 29 38 22Birmingham 189 14 35 15 22 8 14 6 303 43 70 50NYC 47 4 1 2 5 2 2 0 63 8 5 9Utah 171 25 37 7 50 9 12 9 320 50 65 80California I 298 31 39 6. 35 13 19 10 451 60 74 77California II 150 23 21 9 27 5 13 12 260 49 55 57
Other MalesCharlotte 41 7 8 4 2 0 1 0 63 11 13 3Birmingham 161 6 17 5 11 3 7 3 213 17 32 24NYC 13 0 1 0 1 0 0 0 15 0 1 1Utah 22 1 2 0 3 0 1 0 29 1 3 4California I 48 5 4 3 6 1 1 3 71 12 11 11California II 22 1 0 0 1 2 0 2 28 5 2 5 eWhite FemalesCharlotte 73 21 19 17 14 7 16 14 181 59 66 51Birmingham 171 29 45 26 41 18 24 23 377 96 118 106NYC 38 5 7 1 2 2 3 4 62 12 15 11Utah 145 35 26 14 50 27 25 28 350 104 93 130California I 219 34 36 17 38 15 13 21 393 87 87 87California II 113 29 25 14 25 11 21 22 260 76 82 79
Other FemalesCharlotte 25 4 7 12 7 4 3 5 67 25 27 19Birmingham 108 13 29 15 19 7 9 12 212 47 65 47NYC 10 1 1 0 2 1 0 1 16 3 2 4Utah 21 3 3 0 4 4 0 1 36 8 4 9California I 39 7 5 2 9 0 0 2 64 11 9 11California II 18 4 4 1 4 0 0 1 32 6 6 5
137
as well. There is again a problem with the entries for NYC,
as the frequency for 'Other' males for 1973 is o. A
slightly different adjustment was applied in this analysis
than that of section 5.1, where .5 was added to one randomly
selected cell to produce a non-zero function value for the
NYC males subpopulation. Instead, each of the entries for
that particular subpopulation was increased by .5, and then
further adjusted by a multiplicative factor
a =8
j~1(Y123j+ .5)·
{yghij} represents the set of table entr1es~ g=1,2 for
sex,h=1,2 for race,i=1,2, ... 6 for area, and j=1,2, ... 8 for
response profile. Thus, all of the row's entries are
augmented by a small amount so that zero frequencies are
eliminated, but the marginal total is constrained to be the
original total. In this case, the adjustment factor is
a=15/19, and the numbers in parentheses in Table 5.2.1 are
the adjusted counts.
The data are assumed to have the product multinomial
distribution, so that
2Pr{y}= g~l
where n ghij represents the probability that a randomly
selected subject of the g-th sex, h-th race, and i-th area
138
has the j-th profile. The function vector whose elements
are the marginal probabilities of having 2+ colds in 1973,
1974, and 1975 is written
r r , ,I 10 1 0 1 0 1 0 11 II I I I
F(p) = Ap I 10 0 1 1 0 0 1 11 e 1 24 1 pI I I 1I 10 0 0 0 1 1 1 11 1L L oJ oJ
where p = (yghij/nghi *) 1s a 192*1 vector of sample
proportions. A consistent estimator for the covariance
matrix of F is VF=AVpA ' , where Vp is defined as in (3.2.5).
The identity model again was used to assess potential
sources of variation:
E{F(p)}=Atr=IP=P
and Table 5.2.2 contains the results of tests for those
linear hypotheses concerning the estimates of 2+ colds for
each of the subpopulations for each of the three years.
These results indicate that there are no significant
three-way interactions (Q=.05) in the data, and also that
the four-way interaction is non-significant(p=.82). The
two-way interactions which are significant are (sex x area)
(QC=14.49, d.f.=5) and (time x area) (QC=42.07, d.f.=10).
Sex, area, and race were all very significant (all p-values
<.001), while the average time effect was not (p=.4241).
Accordingly, a useful reduced model for F(p) is
E{F(p)}=Atr=X2P 2
This model incorporates the above findings by including
Table 5.2.2
Hypotheses and Resulting Test Statistics ConcerningProportions of 2+ Colds Reported in 1973,1974, and 1975
139
•Hypothesis QC d.f. P-Value
1. No difference between sexes for averages 44.15 1over area x time x race
2. No difference between races for averages 13.64 1over area x time x sex
3. No variation among areas for averages 23.48 5over time x sex x race
4. No variation among times for averages 1.72 2over sex x area x race
5. Homogeneity across race of differences .05 1between sexes for averages acrosstime x area (i.e. race x sex)
6. Homogeneity across areas of differences 14.49 5between sexes for averages acrossrace x time
7. Homogeneity across time of differences .96 2between sexes for averages acrossrace x area
8. Homogeneity across areas of differences 8.86 5between races for averages acrosssex x time
9. Homogeneity across time for differences 1.46 2between races for averages acrosssex x area
10.Homogeneity across time for differences 42.06 10among areas for averages acrosssex x race
11.No average sex x race x area interaction 6.71 5
.000
.000
.000
.424
.824
.010
.620
.115
.483
.000
.243
12.No average sex x race x time interaction .21 2 .814
13.No average sex x area x time interaction 6.63 10
14.No average race x area x time interaction 3.41 10
.160
.910
15.No sex x race x area x time interaction 5.96 10 .818
140
separate intercepts for the six areas, and separate effects
for sex, race, and time within each of the areas. The 30
parameter model X2 is shown in Table 5.2.3, along with
parameter estimates and interpretations. The goodness-of
fit is supported by the non-significance of the Wald
statistic (d.f.=42). The parameter interpretations are as
follows:
~1-~6: reference values for white males in 1973 for
Charlotte, Birmingham, NYC, Utah, California I ,
and California II, respectively
~7-~12: increments for females for each area
~13-~18: increment for 'Other' race for each area
~19-~20: increments for 1974 and 1975, for Charlotte e~21-~22: increments for 1974 and 1975, for Birmingham
~23-~24: increments for 1974 and 1975, for NYC
~25-~26: increments for 1974 and 1975, for Utah
~27-~28: increments for 1974 and 1975, for California I
~29-~30: increments for 1974 and 1975, for California II
Hypotheses concerning interactions between sex and
area, race and area, and time and area were evaluated with
Wald statistics of the form Qc=b'C'(CVbC,)-lCb . Table
5.2.4a displays the hypotheses couched in terms of the
respective parameters as well as the appropriate C matrices.
All tests proved significant (~=.05). It is again of
interest to continue modeling efforts in an attempt to
describe more finely the nature of such interactions, using
the results for model X2 as the inferential justification.
141
Table 5.2.3
Specification Matrix, Estimated Parameters, and Standard Errors forpreliminary Model X2 for 2T Colds
r 1 00 1 B steIB)I 1 10 II 1 01 I .209 .0254I 1 00 I .139 .0130I 1 10 I .120 .0331I 1 01 I .173 .0106I 1 00 I .149 .0136I 1 10 I .199 .0204I 1 01 I .148 .0263I 1 00 I .117 .0165I 1 10 I .080 .0418I 1 01 I .106 .0202I 1 00 .057 .0172I 1 10 .097 .0245I 1 01 -.054 .0258I 1 00 -.052 .0163I 1 10 -.001 .0520I 1 01 -.127 .0246I 1 1 00 -.029 .0235I 1 1 10 -.126 .0346
1 1 01 .044 .02611 1 00 -.073 .02631 1 10 .075 .01441 1 01 .024 .0142
1 1 PO -.017 .03351 1 10 .003 .03401 1 01 .008 .0191
1 1 00 .084 .01991 1 10 .010 .01511 1 01 .0lB .0152
1 1 00 .006 .02061 1 10 .015 .0213
e 1 L 011 1 001 L 101 1 01
1 1 001 1 101 1 01
1 1 001 L LOL L 01
1 1 001 L 101 1 01
1 1 00L 1 101 1 01
1 1 001 1 101 1 01
1 1 001 La
1 1 01
1 1 1 001 1 1 111
I 1 1 1 01I 1 1 1 00, 1 1 1 10I 1 1 1 01I 1 1 1 00
-, 1 1 1 10I 1 1 1 01I 1 1 1 00I 1 1 1 10I 1 1 1 01I 1 1 1 00I 1 1 1 10I 1 1 1 01
e I 1 1 1 00I 1 1 1 10 IL 1 1 1 01 J
141
These models are intended to provide a mechanism with which
one can describe the variation in the data. Efforts first
concentrate on smoothing parameterizations for components
other than time. Consideration of the estimated parameters
and their respective standard errors suggest the following
reductions be investigated:
C4 : There is equality of sex effects for Charlotte,Birmingham, NYC, Utah, and California II
There is equality of race effects for Birminghamand Charlotte, and Utah and California IIrespectively,and null race effects for NYC and California I
Since the model for 1+ asthma discussed in section 5.1
included common parameters for the two California areas,
another model simplification evaluated was whether the area
and sex parameters were equivalent in model X2 for 2+ colds:
C6 : There is equality of reference, sex, and raceparameters for California I and California II.
The contrast matrices which correspond to the above
reductions are presented in Table 5.2.4b, along with the
resulting Wald statistics. These contrasts might be
considered 'compound' in that multiple sets of contrasts are
tested simultaneously. For example, C5 tests whether two
sets of two parameters can each be replaced by one, and at
the same time whether two other effects are nUll. This is a
convenient strategy when the number of functions being
modeled is large and subsequently the number of parameters
involved in any preliminary investigative model such as X2
is also going to be considerable. Reduction-implying
142
Table 5.2.4a
Results of Linear Hypotheses Concerning the ParameterEstimates for the Model X2 for 2+ Colds
r ,I 0 0 0 0 0 0 1 0 0 0 0-1 0 0 0 0 0 0 0 0 0 0 0 a 0 0 0 a 0 0 II 0 0 0 0 a 0 0 1 0 0 0-1 0 0 0 0 0 0 0 0 0 a 0 a a a a a a a I
C1 =1 0 a a a 0 0 0 0 1 a 0-1 0 a a a a a a a a a a a a a a a a a II 0 a a a 0 0 a 0 a 1 0-1 a a 0 a a a a a a a a a a a a a a a II a a a a 0 0 0 a a a 1-1 a a 0 0 a a 0 a a a a a 0 a a a 0 0 IL. .J
H1 : 137 = 138 = 139 = 1310 = all = 131 ZQc = 11.15 p-value = .0484 d.f. = 5
r ,1 a 0 0 0 0 0 0 0 a 0 0 a 1 0 0 0 0-1 a a a a 0 a 0 0 0 o a a II a 0 a a 0 0 0 a a a 0 0 0 1 0 a 0-1 0 0 a a 0 0 0 0 a a a a I
C2 =1 a 0 a a 0 0 a 0 a a 0 a a a 1 0 0-1 0 a a a a a a a a a a a II a a 0 a a 0 a a a a a a a a a 1 0-1 a a 0 a a 0 0 a 0 o 0 a II a a a a a a a a a a a a a a a a 1-1 a a a a a a a a a a a a IL. .J
H2 : al:l = 1314 = 1315 = 1316 = 1317 = 1318Qc = 14.04 p-value = .0154 d. f. = 5
e r ,I a 0 a a a 0 a 0 a 0 a 0 0 a 0 a 0 a a a 1 a a a a 0 a 0-1 a 1I a a 0 a 0 0 a a a 0 0 a a 0 a a a a a a a a 1 a a 0 a 0-1 0 II a 0 0 a a a a a a a a a 0 a 0 a a a a a a a a a 1 a a 0-1 a II a a a a a a a a a a a a a a a a a a a a a a a a a a a 0-1 0 II a a a a a a a a a a a a a a a a a a a a a a a a a a 1 0-1 a I
C3 =1 a a a a a a a a a a a a a a a a a a a a a 1 0 a a a a a 0-1 II a a a a a 0 a a a a a a a a a a a a a a a a a 1 a a a a 0-1 II a a a a a c c a 0 0 0 0 0 a 0 a a a a a 0 a a a a a a a 0-1 I
I
I a a a 0 a a a a a a a a 0 a a a a a a a a a .0 0 a 1 a a 0-1 II a a a a a a a a a a a a a a a a a a a a a a a a a a a 1 0-1 IL. .J
H3
: 1319 = az 1 = aZJ = aZB = an = az 9 ,azo = azz = 1324 = aZ& = aZ8 = a:lO
Qc = 56.32 p-value = .0001 d.f. =10
Table 5.2.4b
Results of Tests for Simplifications Concerning ParameterEstimates for the Model X
2for 2+ Colds
143
r ,I 0 0 0 0 0 0 1 0 0 0 0-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I
C4 =1 0 0 0 0 0 0 0 1 0 0 0-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I
I 0 0 0 0 0 0 0 0 1 0 0-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 II 0 0 0 0 0 0 0 0 0 1 0-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IL. oJ
H3
: a, = ~8 = a. = alo = ~12Qc = 3.07 p-value = .5465 d. f. = 4
r ,I 0 0 0 0 0 0 0 0 0 0 0 0 1-1 0 0 o 0 0 0 0 0 0 0 0 0 0 0 0 0 I
C5 =1 0 0 0 0 0 0 0 0 0 0 0 0 o 0 0 1 0-1 0 0 0 0 0 0 0 0 0 0 0 0 I
I 0 0 0 0 0 ..0 0 0 0 0 0 0 o 0 1 0 o 0 0 0 0 0 0 0 0 0 0 0 0 0 II 0 0 0 0 0 0 0 0 0 0 0 0 o 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 II
L. oJ
H4
: ~I :l = ~14 , ~I 6 = ~I 8 , ~I 5 = 0, ~I , = 0Q
C = 1. 54 p-value = .8187 d. f. = 4
r ,I 0 0 0 0 1-1 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 0 0 0 0 0 0 0 0 II 0 0 0 0 o 0 0 0 0 0 1-1 0 0 0 0 o 0 0 0 0 0 0 0 0 0 0 0 0 0 I eC
6 =1 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 1-1 0 0 0 0 0 0 0 0 0 o 0 0 II 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 0 0 0 0 1 0-1 0 II 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 0 0 0 0 0 1 0-1 IL. oJ
H5 : ~5 = ~6' ~II = ~12' ~I , = ~I 8 , ~27 = ~2 9 , ~28 = a:l 0Qc = 18.58 p-value = .0023 d.f. = 5
144
simplifications can thus be grouped together according to
the effects their corresponding parameters are
characterizing, such as sex, race, and time. If the
simplification is contradicted, then the individual
components can be tested separately. This approach is also
sensible when one considers that the reduced model the
simultaneous reductions imply has a goodness-of-fit chi
square equivalent to that of the original model plus the
quantity equivalent to the Wald statistic for the test of
the simultaneous reduction. The test for equality of the
California parameters is clearly significant(a=.05),
indicating that the California areas can not be treated the
same as they were in section 5.1. Reductions C4 and Cs are
not contradicted and suggest that certain reductions in the
model structure are plausible.
The next step in the model-fitting for 2+ colds was to
incorporate the results of the tests for C4 and Cs into a
reduced model X3 for the 2+ colds estimates. This model is'
expressed
E{F(p)}=F{n}=X3P3
where X3 is of rank 22 and includes six reference parameters
(one for each area), twelve time parameters as in X2 , one
sex effect for California I, and one other sex effect for
the other five areas. Also in the model is one race effect
for both Birmingham and Charlotte, and one other for Utah
and California II. Columns 7,8,9,10 and 12 of X2 have been
added together, as have columns 13 and 14, and columns 16
145
and 18. Columns 15 and 17 have been deleted. Since NYC and
California do not have race effects in this model, their
respective reference values apply to both white and 'Other'
races. The goodness-of-fit for the'model X3 is QW=35.22,
(d.f.=50,p=.94). Note that the difference between the
goodness-of-fit for X3 and X2 - 35.22 - 30.52 = 4.70, and
the difference in degrees of freedom = 30 - 22 = 8. The
value 4.70 is the value of the Wald statistic for the 8
degree of freedom test for the simultaneous testing of
reductions C4 and C5 . The parameter estimates and their
standard errors are displayed in Table 5.2.5.
Table 5.2.5Estimated Parameters and Standard Errors
For Intermediate Model X3 For 2+ Colds
Referenceparameters
~1-~6Sexparameters
~6'~7Raceparameters
~8'~9Timeparameters
~10-~22
.224 .142 .107 .170 .146 .191(.214) (.0133) (.0284) (.0149) (.0139) (.0177)
.113 .054(.0101) (.0171)
-.0552 -.1269(.0137) (.0199)
.045 -.076 .075 .024 -.015 .003(.0261) (.0260) (.0144) (.0141) (.3350) (.0340)
.007 .083 .010 .017 .007 .015(.0180) (.0199) (.0151) (.0151) (.0205) (.0212)
At this point, smoothing has been accomplished for
those effects which represent among subject variation.
Time, on the other hand, is an effect which represents
within subject variation. Such variation is usually smaller
in magnitude than the among subject variation.
Consideration of the parameters reflecting the variability
146
of time for the 2+ colds estimates leads to the following
simplification:
All time effects are null except for the incrementfor Charlotte for 1975, the increment forBirmingham in 1974, and the increment forUtah in 1975.
One can proceed as before, and investigate this
hypothesis with a test of the form H :C~=O, or one can testo
this hypothesis via a less direct, athough equivalent
mechanism. As discussed in Koch, Imrey, et al. (1985),
testing linear hypotheses concerning the elements of ~ is
equivalent to specifying a linear model for ~, i.e.
~ =zt3
where Z is a known (t*(t-c)) orthocomplement to C, ~3 is the
parameter vector for the model x3 ' and t is a «t-c)*l)
vector of parameters. The goodness-of-fit test is equal to
the Wald statistic Qc for the corresponding Ho
' and it can
be shown that the goodness-of-fit statistic Qw,c for X4=X3Z
is equal to Qc plus Qw, the Wald goodness-of-fit statistic
for the original model X3 . Fitting a model Z to ~ is often
useful when formulating H in terms of a contrast matrix iso
cumbersome, but representing it as a linear model in ~ is
straightforward. Here,
147
148
QC = 8.23 (d.f.=9) with p=.52. Thus, the goodness-of-fit
for the reduced model X4 = X3Z is QWC = 35.2261 + 8.2762 =
44.49, with 50 + 9 = 59 degress of freedom. The estimated
parameter vector and its standard error vector are as
follows:
r ,.2444.1528.1016.1730.1535.1977
'( = .1144.0557
-.0551-.1270-.1004
.0649
.08021L .J
r ,.0177.0116.0201.0122.0108.0133
s.e.('()= .0100.0171.0137.0199.0219.0131.01781
L .J
These estimates have the following interpretation:
'(1: Reference value for Charlotte white males in 1973'( 2 : " Birmingham "'(3: " NYC "'(4: " Utah "'( 5: " California I "'( 6 : " California II "'(7: sex effect for all areas except California I1" 8: sex effect for California I'(9: race effect for Birmingham and Charlotte
1"10: race effect for Utah and California II1" 11 : time effect for 1975 for Charlotte1" 12: time effect for 1974 for Birmingham1" 13: time effect for 1975 for Utah
The final step in this model-reduction process, after
measures have been taken to smooth the among subject sources
of variation sex and race, and the within subject source of
variation time, was to examine whether any of the area
reference parameters could be combined. Now, it may have
been decided a priori not to include this phase in the
149
modeling process, i.e. it may make sense to include a
reference parameter for each individual area and to smooth
only the estimated parameters concerning sex, race, and
time. In that case, model X4 would stand as the 'final'
model. However, given that one has decided to proceed, the
following additional simplifications are of interest:
C1 : There is no difference among the referenceparameters for Birmingham, Utah, California I,or California II
C2 : There is no difference between the referenceparameters for Birmingham, Utah,or California I
C1 is contradicted (Qc=9.24, d.f.=3, p=.03), while C2 is not
(QC=2.04, d.f.=2,p=.36). These results imply that model X5 '
which is of rank 11, is suitable. The specification matrix
for X5 is shown in Table 5.2.6, along with parameter
estimates and their standard errors. The goodness-of-fit is
QW=45.49 (d.f.=61), with a p-value of .93.
Another tool with which one can assess the
appropriateness of individual model parameters is residual
analysis. While this has been a prime focus of regression
diagnostics in the application of linear models to
continuous data for many years, it has remained relatively
unused in the analysis of categorical data. Vicki Davis
reviews residual analysis for categorical data in the
context of weighted least squares procedures (Davis, 1984).
One can choose to explore residuals at either the function
level or the probability level; the former may allow one to
Table 5.2.6
Specification Matrix, Estimated Parameters, and Standard Errors forPreliminary Model X5 for h Colds
I 0 B steIB)I 01 1 .245 1.017
1 0 .159 1.0071 1 .101 1.0201 0 .196 1.013
1 .116 1.0101 .050 1.015I -.060 1.012
1 0 .118 1.0191 0 .100 1.022I I .062 1.0131 0 .089 1.0171 01 0
00 Parameter Interpretations0
1 1 0 Bl: Reference value for CharlotteI 1 0 B2: Reference value for Birmingham, Utah,1 1 1 and California I
1 1 0 B3: Reference value for NYC1 1 1 B~: Reference value for California II1 1 0 B5: Increment for females in all areas
1 except California I1 B6: Increlllent for females in Cal ifornia I1 B7: Increment for 'Other' race in Birmingham
1 1 0 and Charlotte1 1 0 I B8: Increment for 'Other' race in Utah and1 1 I I Cali fornla IIe 1 0 I B9: Increment for 1975 n Charlotte1 0 1 BIO: Increment tor 1974 n Birmingham1 0 I BlI: Increment for 1975 n Utah
1 1 0 I1 1 0 11 1 0 I
I 1 1 I 0 11 I 1 1 0 II 1 I 1 1 II I 1 0 11 1 I 1 II 1 1 0 1I 1 1 II 1 1 II 1 I 1I 1 1 0 II 1 I 0 II 1 I 1 II 1 1 0 II 1 1 0 1I 1 1 0 II 11 0 1I 11 0 II 11 0 II 1 I 1 0 II I I 1 0 II I I 1 1 II 1 1 I 0 ,I 1 I I 1 II 1 1 1 0 II I 1 II I I II 1 I II I 1 0 II I 1 0 IJ 1 I I II I I 0 II I I U II 1 1 0 II 11 I 0 II 11 1 0 IL 11 I 0 J
149
150
assess the compatibility of the data with the chosen model,
while the latter may expose irregularities of the model in
terms of the underlying contingency table. At the functionA ~
level, the residuals are defined as F-F, where F=X~. A~
consistent estimator for the covariance matrix of (F-F) can
be written
VF_~=VF-VF=VF-XVbX'
The general linear model specification E{F(p)}=X~ can be
written as
F(p)-X~ + ,
where , represents a vector of unknown true error terms,
assumed to be multivariate normal with mean vector 0 and a
known covariance matrix V. (Unlike the ordinary least
squares regression, the , are not assumed to be
independent.)Given that a particular model is adequate for~
F(p), the residuals (F-F) should also be approximately
multivariate normal. Accordingly, one can compute~
standardized residuals as the individual elements (F-F)
divided by their respective error terms, and assume
normality with 0 mean and unit variance. By calculating
the p-values for these z-scores, one can spot extreme values
which may require additional attention. Also, if large
numbers of the p-values are significant, either via an
~=.10 criterion or a more conservative Bonferroni criterion,
then one may conclude that the model is inappropriate for
many functions and needs to be amended to incorporate
additional sources of variation. Standardized residuals
151
were calculated for the 2+ colds models X2 (30 parameter),
X3
(22 parameter), and X5 (11 parameter) as an additional
method to assess their effectiveness in characterizing the
data. Specifically, it allows one to see whether the
predicted values of any particular functions are adversely
affected by the model reductions which took place. Table
5.2.7 contains the residuals, their standard errors,
standardized residuals, and the corresponding p-values for
models X2 ,X3 , and X5 . X2 is the preliminary model which
included race, sex, and time effects within area. There
were 30 parameters altogether. There are four residuals
which stand out in the sense that their p-values are < .10.
These correspond to the predicted marginal probability of 2+
colds for white males in Utah in 1974 and California II in
1975, and 'Other' males in 1973. 3ust one of these
residuals is significant at the a=.05 level of significance.
This is re-assuring, since one would expect between 3 and 4
of the 72 tests to be significant by chance alone. If one
applies the Bonferroni correction to the tests, the error
rate of a=.05/72=.00070 is not met by any of the p-values.
Table 5.2.7 also contains the residuals and standardized
residuals for the 22 parameter model, which includes
smoothing for some race and sex effects. With this model,
two additional predicted marginal probabilities become
'prominent' in terms of having corresponding p-values less
than .1. These are the estimates for 2+ colds for 'Other'
females in Charlotte in 1973 and in California I in 1974.
Table 5 .. 2.7
Residuals and P-values for Models Xz, X3 and Xsfor Charlotte and California I
152
Charlotte Model X2 Model X3 Model X5
Sex Race Year Res. p Res. p Res. p
Male White 1973 .005 .852 -.010 .712 -.032 .291Male White 1974 .026 .341 .010 .738 .034 .323Male White 1975 .026 .213 .014 .575 .016 .545Male Other 1973 .002 .856 -.001 .978 -.017 .370Ma'le Other 1974 .017 .345 .014 .476 .010 .620Male Other 1975 .002 .914 -.001 .931 .006 .757
Female White 1973 .007 .781 .020 .524 .026 .482 eFemale White 1974 -.024 .116 -.013 .558 -.022 .429Female White 1975 .020 .476 .033 .324 .042 .287Female Other 1973 -.017 .158 -.014 .308 -.003 .894Female Other 1974 .023 .087 .026 .098 .044 .038Female Other 1975 -.007 .663 -.004 .832 .002 .990
California I Model X2 Model X3 Model X5
Sex Race Year Res. p Res. p Res. p
Male White 1973 .016 .207 .022 .111 .012 .454Male White 1974 .006 .659 .012 .399 .012 .454Male White 1975 -.002 .868 .004 .759 .012 .454Male Other 1973 -.003 .851 -.012 .573 -.019 .439Male Other 1974 .013 .434 .004 .861 .004 .878Male Other 1975 -.007 .704 -.015 .480 -.008 .762
Female White 1973 .070 .156 .091 .092 .072 .199Female White 1974 .055 .275 .076 .167 .102 .074Female White 1975 .053 .254 .078 .130 .082 .114Female Other 1973 .017 .439 .021 .384 .007 .780Female Other 1974 .027 .286 .031 .253 .030 .277Female Other 1975 -.007 .758 -.003 .905 .007 .780
153
The p-value for the residuals corresponding to white males
in California I in 1973 is now .15. Only one of the 72
tests is significant at the a=.05 level of significance,
which leads credence to the overall acceptability of the
model X3 .
Finally, note the residuals and associated statistics
for the final 11 parameter model. Despite a marked decrease
in the number of characterizing parameters, the residuals
for X5
include only five with p-values less than .1 and only
one with a p-value significant at the a=.05 level of
significance. The residual for White males in Utah in 1974
is significant at a=.05 (p=.038), while the residuals for
White males in California I in 1973 again has a p-value
below .10. The other residuals with p-values below .10
include 'Other' males in Charlotte in 1975, 'Other' females
in Charlotte in 1974 and in California I in 1974. Since
there are not any extreme 'misses' so far as predicted
functions go, nor any obvious pattern for the residuals with
the smaller p-values with respect to race, sex, time, or
area, one can conclude that the residual analysis for the 11
parameter model supports the adequacy of the model as
implied by its goodness-of-fit statistic. The original
marginal probabilities for the model X5 as well as the
predicted marginal probabilities are displayed in Table
5.2.8.
Table 5.2.8
Observed and Predicted Marginal Probabilities of 2+ Coldsin 1973, 1974 and 1975 with Final Model X5
154
White Other White OtherMales Males Females Females
Area Year Obs Pre Obs Pre Obs Pre Obs Pre
1973 .213 .246 .175 .186 .326 .361 .373 .301Charlotte 1974 .279 .246 .206 .186 .365 .361 .403 .301
1975 .162 .146 .048 .086 .282 .262 .284 .202
1973 .142 .159 .080 .099 .255 .274 .222 .215Birmingham 1974 .231 .221 .150 .161 .313 .337 .307 .277
1975 .165 .159 .113 .099 .281 .274 .22'2 .215
1973 .127 .101 .105 .101 .194 .217 .188 .217NYC 1974 .794 .101 .158 .101 .242 .217 .125 .217
1975 .143 .101 .158 .101 .177 .217 .250 .217
1973 .156 .159 .034 .040 .297 .274 .222 .156 eUtah 1974 .203 .159 .103 .040 .266 .274 .111 .156
1975 .250 .248 .138 .129 .371 .363 .250 .245
1973 .133 .159 .169 .159 .221 .209 .172 .209California 1974 .164 .159 .155 .159 .221 .209 .141 .209
I 1975 .171 .159 .155 .159 .221 .209 .172 .209
1973 .188 .196 .179 .077 .292 .311 .188 .193California 1974 .212 .196 .071 .077 .315 .311 .188 .193
II 1975 .219 .196 .179 .077 .304 .311 .156 .193
155
5.3 Linear Models Analysis for 1+ Asthma for the Area x Sex
Cross-classification for 1973, 1974, and 1975 Combined
One approach to the analysis of the outcome variables
in this dataset that does not take into account the repeated
measurement structure of the data is to combine the data for
the years 1973, 1974, and 1975. One could choose to model
the variation among a demographic cross-classification of
those subjects who reported 1+ asthma for at least one of
the years 1973, 1974, or 1975 versus those who reported no
asthma at all for each of the three years. In order to
compare the results of such an analysis with those reported
in section 5.1, a linear models analysis of those sUbjects
who reported at least one occurrence of asthma symptoms in
the study years 1973-1975 versus those who reported none for
the cross-classification of (area x sex) was completed.
Table 5.3.1 contains the relevant frequencies.
Table 5.3.11 or More Asthma Events Reported in 1973, 1974, or 1975
Area Sex None At Least One
Charlotte Male 185 14Birmingham Male 461 55NYC Male 76 2Utah Male 319 30California I Male 453 69California II Male 249 39Charlotte Female 229 19Birmingham Female 543 46NYC Female 73 5Utah Female 353 33Cailfornia I Female 418 39California II Female 275 17
Thus, the objective of this analysis is to model the
variation among the proportions of subjects who reported 1+
156
asthma in 1973-1975 for the (sex x area) subpopulations
displayed above. Let Yhij represent the frequency of
subjects in the h-th area(h=1,2, ... 6) for the i-th sex
(i=l=Male, i=2=Female), for the j-th response(j=l=No asthma,
j=2=1+ asthma). Then, the function of interest can be
written
F(p) = Ap = [ 0 1 ] e I 12 * P
where P=(Yhij / Yhi *) is the (24 x 1) vector of sample
proportions. Given that the {Yhij } can be considered to
have the product multinomial distribution, one can proceed
to pursue a weighted least squares analysis of F(p) and its
covariance matrix VF=AVpA~
In previous sections, the cell mean (identity) model
was first used in conjunction with a series of hypothesis
tests to produce a preliminary assessment of variation. The
subsequent results usually indicated the direction to take
in forming descriptive models. Since the point of this
section is to compare the results with those of section 5.1,
it was decided to begin with a saturated model which
includes six reference parameters for the six areas and six
additional parameters for the incremental effect of sex
within each of the areas. This model can be written
E{F(p)} = An = Xl~
where
157
r ,1 0 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0 0
X = 0 0 0 0 0 1 0 0 0 0 0 01 1 0 0 0 0 0 1 0 0 0 0 00 1 0 0 0 0 0 1 0 0 0 00 0 1 0 0 0 0 0 1 0 0 00 0 0 1 0 0 0 0 0 1 0 00 0 0 0 1 0 0 0 0 0 1 00 0 0 0 0 1 0 0 0 0 0 1
L
~1'~2'~3'~4'~5 and. ~6 represent reference parameters for
males in areas Charlotte through California II, and
parameters ~7'~8'~9'~10'~11' and ~12 denote incremental
effects for sex within areas Charlotte through California
II. This is a saturated model, and as such there is no
e goodness-of-fit statistic defined. Table 5.3.2 contains the
estimated parameters and their standard errors.
Table 5.3.2Estimates and Standard Errors for Model Xl for
1+ Asthma in the Years 1973-1975 Combined
Paramter Estimate Standard Error
~1 .070 .0181~2 .107 .0136~3 .026 .0179~4 .086 .0150~5 .132 .0148~6 .135 .0202~7 .006 .0248~8 -.029 .0175~9 .038 .0330
~10 -.001 .0207
~11 .042 .0199
~12 -.077 .0244
The hypothesis test for no (sex x area) interaction
results in a QC=12.01 with 5 d.f., which is non-supportive
158
of the hypothesis(p=.03). Further testing was done in order
to investigate potential smoothing across the sex effect
parameters. Specifically, the following simplifications
were evaluated:
C1 : ~7=~8=~9-~10-0,
or the hypothesis of null sex effects for Charlotte,
Birmingham, NYC and Utah. The Wa1d statistic for this test
is non-significant, QC=4.07 with 4 d.f. (p=.40). The
implied reduced model X2 was fit, and further reduction
hypotheses concerning the reference parameters led to a
final model X3 , which is displayed in Table 5.3.3. This
final model includes reference parameters for NYC,
California I, California II, and a common reference
parameter for Charlotte, Birmingham, and Utah combined. In
addition, sex effects included are those for NYC, California
I and California II. The estimates, standard errors, and
observed and predicted functions for the model
E{F(p)}= X3~
are also displayed in Table 5.3.3.
The Wa1d statistic for the goodness-of-fit for model X3 is
QC=5.19 with 6 d.f. The p-va1ue for this chi-square is
.5214. Thus, the model renders an adequate description of
the variation among the estimates of 1+ asthma in the (sex x
area) subpopu1ations. A comparison with the final model for
1+ asthma for 1973, 1974, and 1975 discussed in section 5.1
shows similarities. Both models have similar reference
parameters for California I and California II; however, this
section's analysis did not include smoothing them
Table 5.3.3
Estimates, Standard Errors, Observed and PredictedFunctions for the Model X3 for 1+ Asthma in 1973-1975
Specification Matrix Observed PredictedProportions Proportions
1 0 0 0 0 0 0 .070 .0851 0 0 0 0 0 0 .107 .085
e 0 1 0 0 0 0 0 .026 .0371 0 0 0 0 0 0 .086 .0850 0 1 0 0 0 0 .132 .1330 0 0 1 0 0 0 .135 .1351 0 0 0 0 0 0 .077 .0851 0 0 0 0 0 0 .078 .0850 1 0 0 1 0 0 .064 .0371 0 0 0 0 0 0 .085 .0850 0 1 0 0 1 0 .085 .0900 0 0 1 0 0 1 .058 .058
Estimates and Standard Errors
13 1 : reference value for Charlotte, .085 ± .005Birmingham and Utah
13 2 : reference parameter for NYC .037 ± .01513 3 : reference parameter for Cali I .132 ± .01513 4 : reference parameter for Cali II .135 ± .02013 4 : sex increment for California I -.042 ± .02013 5 : sex increment for California II -.077 ± .024
159
160
into one since the California areas had different sex
effects. Both analyses resulted in a single reference
parameter for NYC. However, in the latter analysis,
reference parameters for the remaining areas could be
smoothed into one, while the earlier modeling effort
maintained separate reference terms for Charlotte,
Birmingham, and Utah. Both models include negative sex
effects .for California; however, the model for the years
combined has a term for the individual California areas
while the earlier model has one parameter for both
California areas. Finally, there can be no variation due to
time for the analysis in this section since the responses
for each year are summed together. There is a significant
effect for 1915 for all males in the earlier analysis.
It should be realized that sections 5.1 and 5.3 are
concerned with different response outcomes. The outcome of
5.3 is the proportion of 1+ asthma for 1913, 1914, and 1915
separately. Since time is a dimension across which the
response measures are determined, the regression set-up
employed also allows it to be modeled as if it were an
independent effect. The outcome measure of section 5.3 is
the proportion of subjects who had 1+ asthma over the years
1913-1915 combined. This quantity has an entirely different
meaning. However, the comparison of analysis efforts is
warranted since section 5.3 illustrates a direction modeling
efforts might take if time were not considered and provides
a benchmark against a modeling effort which does include
161
time. Other response outcomes which may have been
considered in alternative analyses are the proportion of 1+
asthma in strictly 1973, 1974, or 1975. However, lessening
sample sizes may have made it difficult to have the
expected frequencies of 1+ asthma greater than the desired 5
in each of the subpopulations. In summary, there was a
significant, although limited effect for time in the
analysis which included time in its modeling framework.
This seems to indicate its advantage over an analysis fairly
similar in nature which does not allow for a time effect.
5.4 Linear Model Analysis of Mean Colds for
1973,1974, and 1975
Another response measure of interest is the mean
number of colds for the years 1973, 1974, and 1975. The
functions can be calculated via an A matrix applied to the
proportion vector resulting from a contingency table.
kHowever, r=d =64 response profiles (k=O,1,2,3) can lead to
cumbersome matrix manipulations, and it is much more
efficient to turn to computations on the individual case
records in order to produce the appropriate function vector
of mean colds for 1973, 1974, and 1975. Specifically, let
Yikl represent the outcome for the d-th cat~gorical variable
for the l-th subject in the i-th subpopulation, where
1=1,2, .. ,ni *, and n i * is the number of subjects in the i-th
subpopulation. Then let Yil = (Yi1l'Yi2l' .. 'Yidl)' denote
the vector of responses for the l-th individual in the i-th
subpopulation. The sample mean for the k-th response
162
variable in the i-th subpopu1ation can thus be written
Yik
=1
ni*
I:1=1
Yik1
It follows from Central Limit Theory that the vector
Yi=(Yi1'Yi2'Yi3), will be multivariate normal far large
enough n i * (~20), with a covariance matrix estimated by
1= --
ni*
1:i
1: i is the covariance matrix for the case records {Yi1}' and
it is consistently estimated as
11: = --I: (Yi1 - Yi )(Yi1 - yi ,)'
i ni*
For this analysis, the subpopu1ations investigated
were those chosen by the variable selection procedure for 2+
colds in section 4.5. Thus, there will be 24 subpopu1ations
(i=1, .. ,24) and 3 response functions per subpopu1ation---the
mean number of colds in 1973, 1974, and 1975. The estimated
functions are displayed in Table 5.4.1. Since there were no
a priori restrictions concerning the structure of the model
163
Table 5.4.1
Mean Colds in 1973,1974, and 1975By Area, Sex, and Race
Subpop'li.lation 1973 1974 1975
Charlotte White Male .890 .831 .691Charlotte White Female 1.099 1.154 1.010Charlotte Other Male .778 .778 .381Charlotte Other Female 1.149 1. 224 1.015Birmingham White Male .627 .828 .716Birmingham White Female .915 1.066 .966BIrmingham Other Male .366 .582 .512Birmingham Other Female .863 1.019 .783NYC White Male .524 .524 .524NYC White Female .710 .806 .758NYC Other Male .400 .467 .600NYC Other Female .688 .688 1.000Utah White Male .669 .825 .869Utah White Female .951 .991 1.191Utah· Other Male .379 .483 .690Utah Other Female .722 .667 .917California I White Male .588 .667 .672California I White Female .819 .832 .822
e California I Other Male .648 .606 .620California I Other Female .594 .625 .672CaliforniaII White Male .738 .792 .827CaliforniaII White Female .969 1.042 1.012CaliforniaII Other Male .571 .250 .571CaliforniaII Other Female .625 .688 .781
164
for mean colds, hypotheses concerning the variation among
the mean cold estimates were again investigated with the
cell mean model:
E{P} == An = I~ = ~
The results of this preliminary phase are shown in Table
5.4.2. There are no significant four or three-way
interactions. There is a (time x area) interaction, as the
Wald test statistic is QW == 62.01 (d.f.=10). Also, there is
a significant (sex x area) interaction (Qw = 15.90, d.f.=5),
and a marginally significant (race x area) interaction,
using a=.05 as the significance level criterion.
These initial efforts at assessing the important
sources of variation among the estimates led to the
conclusion that an appropriate structure for additional emodeling would be one in which there would be modules for
each area, with effects for time, sex, and race within each
area. This model can be expressed as follows:
E{P} = An == X2~2'
where X2 = x 2 e 16 , and x 2 is written:
r ,1 0 0 0 011 0 0 1 011 0 0 0 111 0 1 0 011 0 1 1 011 0 1 0 11
x == 1 1 0 0 012 1 1 0 1 011 1 0 0 111 1 1 0 011 1 1 1 011 1 1 0 11
L. .J
165
Table 5.4.2
Hypotheses and Resulting Test Statistics ConcerningMean Colds Reported in 1913, 1914, and 1915
• Hypothesis QC d.f. P-Value
1. No difference between sexes for averages 69.41 1over area x time x race
2. No difference between races for averages 24.18 1over area x time x sex
3. No variation among areas for averages 31.23 5over time x sex x race
4. No variation among times for averages 5.19 2over sex x area x race
5. Homogeneity across race of differences .42 1between sexes for averages acrosstime x area
6. Homogeneity across areas of differences 15.90 5between sexes for averages acrossrace x time
1. Homogeneity across time of differences .28 2between sexes for averages acrossrace x area
8. Homogeneity across areas of differences 10.18 2between races for averages acrosssex x time
9. Homogeneity across time for differences 1.86 2between races for averages acrosssex x area
10.Homogeneity across time for differences 62.01 10among areas for averages acrosssex x race
11.No average sex x race x area interaction 1.18 5
.000
.000
.000
.015
.517
.001
.810
.056
.394
.000
.208
12.No average sex x race x time interaction .23 2 .891
13.No average sex x area x time interaction 12.15 10
14.No average race x area x time interaction 9.22 10
.215
.512
15.No sex x race x area x time interaction 1.53 10 .674
166
The goodness-of-fit Wald statistic for this model is QW =
43.15 (d.f.=42), indicating an adequate fit. Table 5.4.3
contains the parameter estimates for this model as well as
their standard errors.
Table 5.4.3Parameter Estimates and Standard Errors for the Area
Modules Model Xl for Mean Colds in 1973,1974, and 1975
ParametersEffect Charlotte Birmingham NYC Utah Calif.I Calif. II
Area .834 .611 .493 .679 .623 .752Sex .349 .317 .233 .257 .158 .241Race - .104 -.178 -.056 -.294 -.108 -.3591974 .020 .184 .031 .093 .038 .0171975 - .190 .069 .061 .229 .045 .056
Standard ErrorsArea .053 .035 .078 .038 .031 .045Sex .059 .038 .094 .047 .040 .057Race .062 .039 .092 .070 .055 .0841974 .051 .030 .074 .039 .032 .0421975 .051 .030 .082 .041 .033 .048
Interactions which surfaced in the previous
investigation were confirmed in this model framework. The
(sex x area) interaction was significant (0=.05) with a Qc =
11.33 (d.f.=5). Similarly, the (race x area) interaction
was significant (Qc = 12.20, d.f.=5) as was the (time x
area) interaction (Qc = 66.65, d.f.=10). Modeling efforts
continued in an attempt to describe the nature of these
interactions by working within the area module structure.
The simplification of equality of reference parameters for
California I and California II was first tested, and
resulted in a Wald statistic of QC=5.46 (1 d.f.). This
reduction is thus contradicted, signifying that the two
Californias can not share a reference parameter. Attention
167
then focused on assessing whether certain increments for
time which appeared to be marginal from their size and
standard error were indeed null. A test for whether the the
increment for 1974 for Charlotte, and both the increments
for 1974 and 1975 for NYC and both Californias were
simultaneously null was performed. This 7 d.f. test
resulted in a test statistic of Qc=4.35, which is supportive
of the simplification.
The contrast tests then focused on simplifications for
the effects pertaining to race and sex. The simplification
of equality of sex increments for the Californias was not
contradicted, as the corresponding test resulted in a Qw of
1.41 (d.f.=l). A subsequent reduction that the sex effects
for the remaining areas be combined into one was also not
contradicted, as the 1 d.f. test resulted in a Qw of .36.
In addition, subsequent testing indicated that the race
effects for NYC and California I to be effectively null, and
equivalence of race effects for Charlotte and Birmingham,
and Utah and California II to be viable. All tests
regarding the parameter estimates produced for model Xl are
summarized in Table 5.4.4.
These hypotheses imply that further model reduction is
appropriate in the analysis of mean colds for 1973, 1974,
and 1975. However, it should be noted that to go from the
preceeding hypothesis tests to a model incorporating all of
the results is an aggressive approach in comparison to the
analysis performed for 2+ colds. In that situation, several
168
Table 5.4.4Hypothesis Tests Pertaining to the Parameters
For the Area Module Model Xl
Hypothesis D.F. P-value
1.1974 time effect for Charlotte, 4.35NYC,California I and II, and 1975effects for NYC, California I and IIare null
2.Reference parameters for California I 5.46and California II are equivalent
3.Sex effects for California I and 1.41California II are equivalent
4.Sex effects for Charlotte, Birmingham 2.26NYC, and Utah are equivalent
5.Race effects for NYC and California I 4.25are null
6.Race effects for Charlotte and 1.04Birmingham are equivalen -
7.Race effects for Utah and .36California II are equivalent
7
1
1
3
2
1
1
.739
.019
.234
.519
.119
.307
.550
stages of hypothesis-testing and model-fitting were employed
in order to come to a satisfactory descriptive model.
Fitting the model implied by the aggregate of the hypothesis
tests in Table 5.4.4 will not necessarily result in an
adequate goodness-of-fit. The change in the goodness-of-fit
statistic is a function of the test statistic for the
simultaneous test of the non-significant hypotheses in Table
5.4.4, and can't be determined on the basis of the
individual test results.
A 15 parameter model was fitted to the data. Table
5.4.5 contains the specification matrix, the estimated
parameters and their corresponding standard errors. The
model is still in area module form, although the modules for
NYC, California I and California II consist of reference
parameters only. The modules for Birmingham and Utah
include reference parameters as well as incremental effects
Table 5.'+.5 ....
Specification Matrix for Model X2 forMean Colds in 1973, 1974 and 1975
SPECIFICATION MATRIX Xz100 0 0 0 0 0 000 0 0001 0 0 0 0 0 0 000 0 0 0 0 01 1 0 0 0 0 0 0 0 0 0 0 0 0 0100 000 0 0 000 0 0 1 01 000 0 0 0 0 000 0 0 1 01 1 0 0 0 0 0 0 0 0 000 1 01 000 0 0 0 0 000 1 000100 0 0 0 0 0 000 1 0001 1 000 0 0 0 000 1 0001 000 0 0 0 000 0 1 0 1 01 0 0 0 0 0 0 0 000 1 0 1 01 1 000 0 0 000 0 1 0 1 0o 0 100 0 0 000 000 0 0001 1 0 0 0 0 0 0 0 000 0o 0 1 0 1 0 0 0 0 0 0 0 0 0 0o 0 1 0 0 0 0 0 0 0 0 0 0 1 0001 100 0 0 0 0 0 0 0 1 0o 0 101 0 0 000 0 0 0 1 0o 0 100 0 0 000 0 100 0001 1 0 0 0 0 000 1 000o 0 1 0 1 0 0 0 000 1 000o 0 100 0 0 0 000 1 0 1 0001 1 0 0 0 0 000 1 0 1 0o 0 1 0 1 0 0 0 000 1 0 1 000000 1 0 0 0 0 0 0 0 0 000000 1 0 000 0 0 0 0 0o 000 0 1 0 000 0 0 0 0 0o 000 0 1 0 0 0 0 0 0 0 0 0o 0 0 0 0 1 0 000 0 0 0 0 0o 0 0 0 0 1 0 0 0 0 0 0 000o 0 0 0 0 1 0 000 0 1 000o 0 0 0 0 1 0 000 0 100 0o 0 0 0 0 1 0 000 0 1 000o 0 000 1 0 0 0 0 0 1 000o 0 000 1 0 0 0 0 0 1 000o 0 0 0 0 1 000 0 0 100 0o 000 0 a 1 0 0 0 0 0 0 0 0o 0 0 000 1 100 0 000 0o 000 0 0 1 0 1 0 000 0 0o 0 0 0 0 0 1 000 0 000 1o 000 0 0 1 100 0 0 001a 0 0 000 1 0 1 0 0 0 0 0 1o 0 0 0 0 0 1 000 0 1 000o 0 0 0 001 100 0 100 0o a 0 0 0 0 1 0 100 1 000o 0 0 0 0 0 1 0 0 0 0 1 0 0 1-o 0 0 0 0 a 1 100 0 100 1o 0 0 0 0 0 1 0 100 1 001o 0 0 0 0 0 0 001 0 000 0o 0 0 0 000 0 0 1 0 0 0 0 0o 000 0 000 a 1 0 0 0 0 0o a 0 0 0 0 0 001 000 0 0000 0 a 0 a 001 0 0 000o 0 0 0 0 0 0 0 0 1 0 0 000o 0 0 0 0 0 0 001 0 0 100o 0 0 0 0 0 0 0 0 100 100a 0 a 0 a 0 0 0 a 100 1 0 0o a 0 a 0 0 a 0 0 100 100o a 0 0 0 a 0 0 0 1 0 0 100o a 0 0 0 0 0 0 0 1 0 0 100o 0 0 a 0 0 0 000 1 0 0 0 0a 0 0 0 0 0 0 000 1 000 0o 0 0 0 0 0 0 000 1 0 a 0 0000 0 0 0 0 000 1 000 1o 0 000 0 0 0 0 0 100 0 1o 0 000 0 0 000 100 0 100000 0 0 0 0 0 1 0 100o 0 0 0 0 0 0 0 0 0 1 0 100o 0 0 0 0 0 0 0 0 0 1 0 100o 0 0 0 0 0 0 0 0 0 1 0 10100000 0 0 0 0 a 1 0 101o 0 0 0 0 0 0 000 1 0 101
169
170
Table 5.4.5b
Parameter Estimates and Standard Errors for ModelX2 for Mean Colds in 1973, 1974 and 1975
Parameter Interpretation~1: Predicted value for Charlotte~2: Incremental value for 1975 for Charlotte~3: Predicted value for Birmingham~4: Incremental value for 1974 for Birmingham~5: Incremental value for 1975 for Birmingham~6: Predicted value for NYC~7: Predicted value for Utah~8: Incremental value for 1974 for Utah~9: Incremental value for 1975 for Utah~10: Predicted value for California I~11: Predicted value for California II~12: Incremental value for females (except Cal I & II)~13: Incremental value for females in Cal I & II~14: Incremental value for 'other' in Char & Birm~15: Incremental value for 'other' in Utah & Cal II
Estimates andStandard Errors
d.f. = 57Qw = 70.25
.774
.121
.612
.185
.069
.475
.664
.090
.226
.624
.796
.300
.178-.165-.313
± .034± .044± .031± .030± .030± .044± .034± .039± .041± .024± .032± .025± .032± .033± .053
p-value = .11
171
for years 1974 and 1975. The module for Charlotte consists
of a reference parameter and an incremental effect for 1975.
There are four remaining parameters. These include an
overall parameter for the increment for females for the
California areas, and a similar parameter for all the other
areas. There are also two overall parameters for race. One
is an effect for Charlotte and Birmingham. The other is an
effect for Utah and C9 1ifornia II. The Wald statistic for
the goodness-of-fit of the model is QW = 70.25, with 57 d.f.
The associated p-value is .11, indicating that the model is
marginally appropriate.
So, although the goodness-of-fit for this model may be
considered marginal, at least in comparison with the
criteria for other models discussed in this chapter, the
model does render an adequate description of the 72
estimates of mean colds in terms of 15 parameters. Since
the point of all these analyses is the description of the
CHESS data rather than inference and decision-making, one
has greater freedom in the model-fitting process in terms of
criteria and appropriateness than in a strict inference
setting. What this model means is that mean colds can be
explained with varying number of parameters, depending on
the area. Mean colds in NYC can be explained with two, a
reference parameter of .475, and an increment for females of
.300. Thus, the reference parameter reflects the mean colds
for all races, as well as for all three years. A look at
the model structure for Utah reveals the most complicated
172
parameterization. The reference parameter estimate of .664
is the predicted value for white males in 1973. There are
additional Utah-specific increments for the years 1974 and
1975. There are also incremental effects for females and
'Other' races.
Thus, this model accounts for area and time
interaction by including area-specific effects for time for
Charlotte, Birmingham, and Utah. Sex and race are accounted
for with two overall incremental effects each. The
increment for in non-California areas is fairly substantial
(.300), when one notes that the reference values range from
.475 to .796. The corresponding increment for females in
either California area is roughly half as much. Both race
effects can be considered 'decremental' in that they are
negative. One final note is that it might be possible to
smooth some of the area reference parameters together,
particularly those for Birmingham and California I.
However, the objective of modelling is not necessarily to
seek out the most succinct structure which can be allowed by
the least stringent goodness-of-fit criterion. Instead, a
parallel consideration should be to find a model which makes
sense from a substantive point of view and exhibits
structural clarity. There is often a point where further
parameter reductions induce complication rather than
clarity.
CHAPTER VI
ANALYSIS OF INCOMPLETE DATA
6.1 Introduction
The previous chapter was concerned with the analysis
of the complete data, defined as those observations which
had information for each of the three years studied. After
variable selection procedures were performed to determine
the subpopulations which exhibited the most variation for
the outcome measures 1+ asthma and 2+ colds for 1913, 1914,
and 1915, linear model analyses were performed to determine
the relationship of these response measures to time and the
selected independent variables. Mean colds in 1973, 1974,
and 1975 were also analysed.
By including incomplete observations into subsequent
analyses, one is able to increase sample sizes greatly.
Since there are 9806 observations with data for 1973, 10162
observations with data for 1914, and 11560 observations with
data for 1975, the sample sizes for estimates derived from
these subsets are at least doubled when compared to those of
the complete data. One way to utilize these additional data
is to pursue separate univariate analyses for the three
years. Section 6.2 is concerned with developing linear
models to describe the variation among sex and area
subpopulations for 1+ asthma for the years 1973, 1974, and
175
1975 considered separately.
The other way to include all possible data in analyses
is to apply missing data adjustments in conducting
multivariate analyses. The remainder of this chapter
addresses the use of missing data strategies discussed in
Section 3.5 in the multivariate analyses of selected
response measures. The technique of supplemental margins is
applied in the analysis of the proportion of subjects
reporting 1 or more colds in Section 6.3. Multivariate
ratio estimation is used for the same (area x sex) framework
in order to compare the two strategies. Section 6.4 is
concerned with the analysis of 1+ asthma for the years 1973,
1974, and 1975 through multivariate ratio estimation as the
method of adjusting for missing data. Mean colds in 1973,
1974 and 1975 are the response functions analysed with
multivariate ratio estimation in section 6.5. A discussion
of the relative difficulties and merits of the two different
missing data strategies concludes Chapter VI.
6.2 Univariate Analysis of 1+ Asthma in 1973, 1974, and 1975
Table 6.2.1 contains the estimates of 1+ asthma for
the years 1973, 1974, and 1975. These estimates are based
on those subjects who had valid data for those years,
regardless of which data pattern group they match ---i.e.
singles, doubles, or triples. Thus, some of the
observations on which the estimates for 1973 are based may
not have values for 1974 or 1975, and may be considered
incomplete data vectors. Weighted least squares analyses
Table 6.2.1
Estimates of 1+ Asthma for 1973, 1974 and 1975and Standard Errors by Sex and Area
1 9 7 3 (n=9806) StandardArea Sex Estimates Errors
Charlotte M .0556 .0090Birmingham M. .0663 .0063NYC M .0539 .0013Utah M .0499 .0077California I M .0886 .0089California II M .0932 .0108Charlotte F .0432 .0078Birmingham F .0496 .0057NYC F .0493 .0012Utah F .0444 .0072California I F .0534 .0074California II F .0492 .0086
1 9 7 4 (n=10762)
e Charlotte M .0600 .0094Birmingham M .0567 .0052NYC M .0567 .0099Utah M .0506 .0078California I M .0745 .0086California II M .0101 .0116Charlotte F .0480 .0084Birmingham F .0471 .0049NYC F .0441 .0092Utah F .0338 .0065California I F .0554 .0078California II F .0580 .0098
1 9 7 5 (n=11560)Charlotte M .0734 .0089Birmingham M .0753 .0059NYC M .0588 .0104Utah M .0628 .0860California I M .0110 .0996California II M .0110 .0117Charlotte F .0609 .0081Birmingham F .0682 .0057NYC F .0495 .0101Utah F .0526 .0081California I F .0701 .0082California II F .0585 .0096
176
177
were carried out for each year as though the data for each
year constituted a separate dataset. The analysis for 1973
is based on 9806 observations, the 1974 analysis is based on
10,762 observations and the analysis for 1975 is based on
11,560 observations. Some observations are included in just
one year's analysis, some in two years, and still others in
all three years (those which were included in the 'complete
data' analysed in Chapter V.)
First, consider the analysis for the 1973 data. Let
Yil = 1 if asthma is reported in 1973
== 0 otherwise
L ~ (- , - I -, )' d t h d f Iet ~= Y1 'Y2 ' •.. Ys eno e t e compoun vector 0 samp e
means for the s subpopulations (i=1,2, .. ,12) created by
cross-classifying area and sex. Thus, Y1 represents the
estimated proportion of Charlotte males reporting asthma in.
1973, and Y2 represents the proportion for Charlotte
females. Let P==(P 1 ,P 2 , ••• Ps )' denote the expected value of
Yi. A linear model for P can then be expressed as
E{Y} = P == xp
where X is the known (u x t) model specification matrix with
full rank t and P is the (t x 1) vector of unknown
parameters. The specification of Vy ' the covariance matrix
of y, is found in expressions (3.2.31) and (3.2.29) in
Chapter III. Weighted least squares estimation is
appropriate to obtain an estimate for p as
b == (X'V_- 1X)-lX,v_-1yY Y
By asymptotic theory arguments, b has an approximate
178
multivariate normal distribution with
-1 -1EA{b} = p, and VarA{b} = (X1Vy X)
A cell mean model was first used to gain a preliminary
assessment of the variation among the estimates. This model
is expressed as
E{y} = Xp = IP = P
Linear hypotheses of the form Cp = 0 were then used to
investigate variation, in particular whether there were sex
differences, area differences, or (sex x area) interactions
present for the 1973 estimates of the proportions reporting
asthma. Table 6.2.2 contains the C matrices used in these
tests, as well as the resulting test statistics
Qc = b'C'(CVbC·)-1Cb .
Similar strategies were applied to the 1974 and 1975
estimates, and those results are also presented in Table
6.2.2. The outcomes of these tests are similar for the
three analyses. There are significant sex and area
differences for all three. However, the 1973 and 1974 tests
for (sex x area) interaction are clearly non-significant,
while the corresponding test for 1975 with Qc = 10.74 (5
d.f.) has a p-value of .0568.
Accordingly, reduced models with main effects for area
and sex were fitted for the 1973 and 1974 estimates; its
form was
179
where
r ,I 1I 1 1I· 1 1I 1 1 1I 1 1
X = ·1 1 1 12-
I 1 1I 1 1 1I 1 1I 1 1 1I 1 1I 1 1 1l-
and ~2 is a vector of unknown parameters. If the nearly
significant (sex x area) interaction for 1975 is interpreted
as a chance event, then this same model could have been
applied to 1975. However, the nature,of this potential
interaction is of some descriptive interest, and so a
different model was investigated. This model has the form
r ,1 0 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0
X2 = 1 0 0 0 0 0 1 0 0 0 0 00 1 0 0 0 0 0 1 0 0 0 00 0 1 0 0 0 0 0 1 0 0 00 0 0 1 0 0 0 0 0 1 0 00 0 0 0 1 0 0 0 0 0 1 00 0 0 0 0 1 0 0 0 0 0 1
l-
which includes separate predicted reference parameters for
each area and incremental effects for sex within each year.
The parameter estimates for the 1973 and 1974 analyses are
displayed in Table 6.2.3. The Wald goodness-of-fit
statistic for the 1973 analysis is Qw = 7.25 (d.f.=5) which
Table 6.2.2
Results of Linear Hypotheses Concerning Estimatesof 1+ Asthma in 1973, 1974 and 1975
Hypothesis Year QC df P-value
HI: There are no sex 1973 13.69 1 .0002differences 1974 14.83 1 .0001
1975 18.03 1 .0000
Hz : There are no area 1973 14.46 5 .0129differences 1974 20.82 5 .0009
1975 23.21 5 .0003
H3 : There is no sex x 1973 7.25 5 .2025
e area interaction 1974 4.16 5 .52601975 10.74 5 .0568
Contrast Matrices For Hypothesis Tests
r ,HI : CI = I 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 I
L oJ
r ,Hz : Cz = I 1 0 0 0 0 -1 1 0 0 0 0 -1 I
I 0 1 0 0 0 -1 0 1 0 0 0 -1 II 0 0 1 0 0 -1 0 0 1 0 0 -1 II 0 0 0 1 0 -1 0 0 0 1 0 -1 II 0 0 0 0 1 -1 0 0 0 0 1 -1 IL oJ
r ,H3 : C3 = I 1 0 0 0 0 -1 -1 0 0 0 0 1 I
I 0 1 0 0 0 -1 0 -1 0 0 0 1 II 0 0 1 0 0 -1 0 0 -1 0 0 1 II 0 0 0 1 0 -1 0 0 0 -1 0 1 II 0 0 0 0 1 -1 0 0 0 0 -1 1 IL oJ
180
Table 6.2.3
Specification Matrix, Estimated Parameters,Standard Errors and Predicted Values forModel X2 for the 1973 and 1974 Analyses
181
Specification Matrix X21 0 0 0 0 0 01 0 0 0 0 0 11 1 0 0 0 0 01 1 0 0 0 0 11 0 1 0 0 0 01 0 1 0 0 0 11 0 0 1 0 0 01 0 0 1 0 0 11 0 0 0 1 0 01 0 0 0 1 0 11 0 0 0 0 1 01 0 0 0 0 1 1
Parameter Interpretation e~I : Predicted value for Charlotte~2 : Incremental effect for Birmingham~3 : Incremental effect for NYC~4 : Incremental effect for Utah~s : Incremental effect for California I~8 : Incremental effect for California II~7 : Incremental effect for females
1973 1974Estimates and Estimates and
Standard Errors Standard Errors
~ I : .0592 ± .0065 .0620 ± .0067~2 : .0082 ± .0073 -.0021 ± .0072~3 : .0022 ± .0108 -.0037 ± .0092~4 : -.0022 ± .0079 -.0012 ± .0080~s : .0186 ± .0082 .0105 ± .0085~I : .0186 ± .0090 .0229 ± .0098~7 : -.0188 ± .0047 -.0156 ± .0044
182
has a p-value of .20 and thus is supportive of the model.
The Wald statistic for the same model for 1974 is Q =W
4.16(d.f.=5}, which is also non-significant. No further
model reduction was undertaken for these two analyses.
Predicted values are displayed in Table 6.2.5.
The model X2 for the 1975 data is saturated, and
accordingly there is no goodness-of-fit statistic defined.
Two linear hypotheses concerning the model parameters were
tested. Hi investigates whether the sex effects for
Charlotte, Birmingham, NYC, and Utah are nUll. The
hypothesis is stated
Hi: ~7 = ~8 = ~9 = ~10 = 0
and the resulting test statistic is QC = 3.00 (d.f.=4),
which is non-significant. The other hypothesis investigated
was whether the area parameters for California I and
California II were equivalent, as well as whether their sex
effects were the same. This hypothesis can be stated:
and was also non-significant.
The implied six parameter model was then fit to the
1975 estimates. This model can be stated:
Table 6.2.4 contains the specification matrix for the model
X3 as well as the estimated parameters and standard errors.
Predicted values are contained in Table 6.2.5. The
goodness-of-fit statistic is Qw = 3.84 (d.f.=6), with a p
value of .70.
183
Table 6.2.4
Specification Matrix, Estimated Parameters, Standard Errorsand Predicted Values for Model X3 for the 1975 Analysis
Estimates andStandard Errors
Specification Matrix X3
1 0 0 0 0 01 0 0 0 0 0o 1 000 0o 1 0 0 O' 0o 0 1 000o 0 1 000000 1 0 0000 1 0 0o 000 1 0o 0 0 0 1 1o 0 0 0 1 0o 000 1 1
II :2 :3 :4 :5 :6 :
.0666
.0717
.0540
.0574
.0110-.0450
± .0060± .0041± .0072± .0059± .0074± .0097
Parameter Interpretation
~I : Predicted value for Charlotte~2 : Predicted value for Birmingham~3 : Predicted value for NYC~4 : Predicted value for Utah~s : Predicted value for California I and II~6: Incremental effect for females
Qw = 3.84 p-value = .699 d.f. = 6
Table 6.2.5
Predicted Values and Standard Errors for 1+ Asthmain 1913, 1914 and 1915 from Univariate Analyses
For each of the Three Years
184
185
6.3 Supplemental Margins in the Analysis of the Proportion
of Colds Reported in 1973, 1974, and 1975
The use of supplemental margins in the analysis of the
CHESS data was first illustrated in Chapter III, as an
example in the section pertaining to supplemental margins
methodology. The primary objective of that analysis was to
estimate the probability of reporting colds in 1973, 1974,
and 1975 (n73,n74,n75) for the study population restricted
to Birmingham females. The proportion vector included
components representing the different response profiles for
the triple year, double years, and single years sources of
data (seven in all). Application. of the appropriate A
matrix formed marginal probabilities of reporting colds in
1973, 1974, and 1975 whenever they existed for each of the
subpopulations representing a different data source. These
functions were then modeled with the structure
(6.3.1) E{F(p)} = Xn
where n=(n73,n74,n75) and
1 0 00 1 00 0 11 0 00 1 0
(6.3.2) X 1 0 0=0 0 10 1 00 0 11 0 00 1 00 0 1
The same analysis was repeated for each of the (area x
sex) combinations, resulting in estimates of w73,n74,n75 for
186
the twelve demographic subpopulations. An alternative
strategy used might have been to combine all (area x sex)
subpopulations together and perform an overall analysis.
One potential problem with this might be that the assumption
that the parameters w,3,w,4,w,5 represent the probability
of reporting a cold in 1913, 1914, and 1915 respectively for
each of the constituent data pattern sUbpopulations may not
hold for each of the (area x sex) sUbpopulations. By
processing one (sex x area) subpopulation at a time, one can
evaluate this assumption individually by evaluating
goodness-of-fit statistics for the model X of (6.3.2) and
let the results identify issues to be addressed with
subsequent analysis efforts. Another potential problem is
mechanical. A great many functions would need to be created
for an overall analysis, i.e. the (12 x 26) A matrix applied
to the (26 x 1) proportion vector PG for Birmingham females
would have to be one of 12 blocks in a (144 x 312)
transformation matrix A applied to a (312 x 1) proportion
vector PG. Such a proportion vector and its (312 x 312)
covariance matrix are beyond the capacity of many current
computer programs.
The estimates resulting from these analyses are
reported in Table 6.3.1, along with their standard errors
and their goodness-of-fit statistics. While there is no
apparent trend over time, females would appear to report
consistently higher proportions of colds than males. The
goodness-of-fit statistics suggest that the assumption of
187
Table 6.3.1Estimates of n ,n 74 ,n 7 by Area and Sex
and Associated G50dness-5f-Fit Statistics
Area Sex n 73 n 74 n 75 n QW
Charlotte Male .552 .514 .460 1468 15.12*Charlotte Female .661 .674 .640 1437 15.18*Birmingham Male .439 .512 .469 3721 24.27**Birmingham Female .599 .635 .574 3374 6.25NYC Male .415 .404 .371 1001 6.03NYC Female .489 .457 .521 932 12.72Utah Male .461 .528 .528 1463 4.90Utah Female .595 .611 .648 1411 21.46*California I Male .427 .471 .471 1650 11.65California I Female .519 .531 .548 1561 8.71CaliforniaII Male .473 .478 .493 1276 17.47*CaliforniaII Female .582 .590 .548 1047 9.29
Note: QWhas 9 d.f.; '* , indicates significance at oc=.10,while' *' indicates significance at oc=.05.
188
similar parameters n73,n74,n75 across the component groups
of.data sources does not hold for all (area x sex)
subpopulations. While seven subpopulations have entirely
adequate goodness-of-fit statistics, five are questionable,
according to an (a=.05) significance level criterion.
Birmingham males have an especially poor fit, with a Qw of
24.27 and an accompanying p-value of .004.
If the goodness-of-fit tests were all non-significant,
one could immediately proceed to a second modeling phase in
which the estimates of the parameter vector
n'=(n 1 ',n2 ', .•. ,ns ') itself becomes the functions of
interest, where ni=(n73,n74,n75) for i=1,2, .. ,12 for the
twelve (sex x area) sUbpopulations. The model
(6.3.3) E{F(n)}=F(n)=Xf
can then be fit to describe the variation across area, sex,
and time. By forcing the X structure on all the (sex x
area) subpopulations, one is essentially choosing to
generate weighted regression estimates for n 73 , n 74 and n 75 ,
averaging those from the seven data source subpopulations.
The goodness-of-fit statistics Qw for the model X are thus
ignored for this objective, but do provide information on
the quality of the estimates.
The alternative strategy would be to assess the
implications of using the incomplete data as augmenting a
complete data analysis where compatibilities exist. One
would use the estimates and covariances from the incomplete
data subpopulations (i.e. double years and single years)
189
when they were similar to the corresponding estimates from
the complete data. One is thus extending the sample size
for the analysis beyond that of the complete data vectors to
gain better precision. However, the assumptions that the
complete data estimates are the 'true' estimates may be at
best a leap of faith and may lead to biased results.
The forced similarity strategy was applied to the
estimated parameters of Table 6.3.1. The preliminary model
E{F(n)} = Xf=1
was first fit to the estimates, where X is the identity
model, and F(n) is the (36 x 1) vector consisting of the
parameter estimates w73 ' w74 ' and w75 for males and females
in Charlotte, Birmingham, NYC, Utah, California I, and
California II respectively. Hypotheses of the form
Ho: cr=Owere evaluated to gain a preliminary assessment of the
sources of variation among the estimates. The resulting
test statistics are displayed in Table 6.3.2. The tests for
sex and area are highly significant, while the test for time
is borderline at the a=.05 level of significance. However,
the interaction of time with area is highly
significant(Qc=47.70,d.f.=10). These findings agree with
those of previous analyses. Unlike most of the findings in
the complete data analyses, there is a significant three-way
interaction.
Due to the existence of the three-way interaction, the
decision was made to continue with separate analyses for
190
females and males. Additional modeling is for the purpose
of seeking appropriate descriptions of the interactions in
the data. Steps are taken to eliminate extraneous sources
of variation so as to find settings (subpopulations x times)
with similar predicted values. The models by which these
are "clustered" are specified and evaluated. These clusters
are of descriptive interest in clarifying the nature of the
(area x sex x time) interaction. The function vector of
probability estimates for males was first constructed by
,. '" ...piecing together (w73,w74,w75) for the males in Charlotte,
Birmingham, NYC, Utah, California I, and California II to
form an (18 x 1) function vector. The appropriate variances
and covariances were also joined together to create the
appropriate (18 x 18) covariance matrix. A model with the
structure
(6.3.4)
was first fitted to the data, where P(wM) denotes the male
predicted probabilities and X is the (18 x 18) specification
matrix with reference parameters for each area for 1973 and
incremental effects for 1974 and 1975 within each area. The
model is saturated, and accordingly there is no goodness-of-
fit defined. The specification matrix, parameter estimates
and their standard errors are displayed in Table 6.3.3.
Parameters t 1-t6
represent predicted reference values for
the probability of reporting colds during 1973 in the areas
Charlotte, Birmingham, NYC, Utah, California I and
California II, respectively; t7-t
12denote incremental
Table 6.3.2
Results of Linear Hypotheses Concerning Estimates of theProbability of Reporting Colds in 1973, 1974, and 1975
191
Hypothesis QC d.f. p-val
1.No differences between sexes for 275.19 1 .0000averages over area x time
2.No variation among areas for averages 160.33 5 .0000over sex x time
3.No variation among time for averages 5.65 2 .0544over ~ex x area
4.Homogeneity across time for differences 1.77 2 .4129between sexes for averages across areas(i.e.no time x sex interaction)
5.Homogeneity across areas for differences 16.53 5 .0055between sexes for averages across time
6.Homogeneity across area for differences 47.70 10 .0000among times for averages across sex
7.No average area x sex x time 21.86 10 .0158interaction
192
Table 6.3.3
Specification Matrix, Parameter Estimates, and StandardErrors for Intermediate Model X for Males
Specification Matrix Parameter Estimates and Ste'sMales Females
r ,1 0 0 1"1 .552 ( .019) .661 ( .018)1 1 0 1"2 .439 .(.012) .599 ( .013)1 0 1 1"3 .415 ( .028) .489 ( .028)
1 0 0 1"4 .461 ( .011) .595 ( .011)1 1 0 1"5 .421 ( .016) .519 ( .016)1 0 1 1"6 .413 ( .018) .582 ( .019)
1 0 0 1"1 -.038 ( .026) .013 ( .024)1 1 0 1"8 .013 (.016) .036 ( .016)1 0 1 1"9 -.011 ( .034) -.033 ( .035)
1 0 0 1"10 .061 ( .023) .015 ( .023)1 1 0 1" 11 .044 (.021) .012 ( .021)1 0 1 1" 12 .005 ( .024) .008 ( .024)
1 0 0 1"13 -.092 ( .025) -.021 ( .023)1 1 0 1" 14 .029 ( .016) -.025 (.016)1 0 1 1" 15 -.044 ( .034) .032 ( .036)
1 0 0 1"16 .061 ( .024) .052 ( .023)
e 1 1 0 1" 11 .044 ( .021 ) .029 ( .022)1 0 11 1"18 .020 ( .026) -.034 ( .021)
L. oJ
193
effects for the year 1974 for the areas listed above
respectively, while parameters t13
-t18
represent the
corresponding effects for 1975.
The significance of the effects for area in 1973,
time, and (time x area) were evaluated with Wald statistics
for the hypotheses H1 , H2 , and H3 respectively:
H1 : t 1=1' 2=t 3=1' 4=1' 5=1' 6
H2 : t 7=1' 8=1' 9=t 10=1' 11 =1' 12=0,
t 13=1' 14=t 15=t 16=1' 17=1' 18=0
H3 : t 7=1' 8=1' 9=1' 10=1' 11 =1' 12'
t 13=1' 14=1' 15=t 16=1' 17=1' 18
All these tests were significant. Qc for the test for area
was 33.59 with 5 d.f. Qc for time was also significant at
55.67, with 12 d.f. and likewise there is clearly a
significant (area x time) interaction (QC=37.19, d.f.=10).
All p-values are less than .0001.
The following reduction was investigated with a test
of the form Ho : 01=0 which was evaluated with the Wald
statistic Qc~'C'(CVtC,)-l~.
C1: t 2=1' 3=1' 5' and t 4=1' 6
This simplification states that the reference parameters for
Birmingham, NYC, and California I are equivalent, and
simultaneously, that the reference parameters for Utah and
California II are equivalent. The C matrix for this
hypothesis is written
194
r 110 1 0 0-1 0 0 0 0 0 0 0 0 0 o 0 0 01
C= I 110 0 1 0-1 0 0 0 0 0 0 0 0 0 o 0 0 01I I10 0 0 1 0-1 0 0 0 ci 0 0 0 0 o 0 0 01L. oJ
•and the resulting QC=1.08 (d.f.=3), which is clearly
supportive. The implied fifteen parameter model,
incorporating C1 by reducing the number of area reference
parameters from 6 to 3, was then fitted to the estimates.
The model structure X2 is presented in Table 6.3.4, along
with the resulting parameter vector t 2 and the corresponding
standard errors. The QW for this model has the same value
as QC for C1 .
Additional analysis efforts are directed at
ascertaining whether variation can be described more
succinctly with model simplifications. As noted in Chapter
V, such simplication is justified by the suitability of
model (6.3.4) for these data, as confirmed by its previously
noted goodness of fit. Further tests concerning model
parameters are intended to motivate subsequent parameter
smoothing, and resulting p-values are considered descriptive
gUidelines for such analysis. Consideration of estimates in
Table 6.3.4 indicates that many of the time effects are
negligible. The time effects for 1974 for Charlotte, NYC,
and California II are very small, as is the 1975 effects for
California II. In addition, the 1975 effects for Birmingham
and California I appear to be similar,
195
Table 6.3.4
Specification Matrix, Estimated Parameters and Standard Er~ors
for Reduced Model XR for the Probability of ReportingColds in 1973, 197~ and 1975 for the Males Analysis
Estimates andStandard Errors
r Specification Matrix X ,1 0 0 0 0 0 0 0 0 0 0 0 oRo 0100 1 0 0 0 0 0 0 0 0 0 0 0100 0 0 0 000 1 0 0 0 0 0o 1 0 0 0 0 000 000 000o 1 0 0 1 0 0 000 000 0 0o 1 0 0 0 0 000 0 1 0 0 0 0o 1 0 0 0 0 0 0 0 0 0 0 0 0 0o 1 000 1 0 0 0 0 0 0 0 0 0o 1 0 0 000 0 0 0 0 1 000o 0 1 0 0 0 0 0 0 0 0 0 0 0 00010001 0 0 0 0 0 0 0 0o 0 1 0 0 0 0 0 0 000 1 0 0o 1 0 0 0 0 0 0 0 0 0 0 0 0 0o 1 0 0 0 0 0 1 0 0 0 0 0 0 0o 1 0 0 0 0 0 0 0 0 0 001 0o 0 1 0 0 0 0 0 0 0 000 0 0o 0 1 0 0 0 001 0 000 0 0o 0 1 000 0 0 0 0 000 0 1
L.
.552
.433
.467-.038
.079-.028
.062
.039
.010-.092
.036-.061
.062
.039
.026
± .019± .009± .013± .026± .014± .022± .021± .018± .021± .025± .014± .023± .021± .017± .022
Parameter Interpretationf R1 : Predicted value for Charlottef R2 : redicted value for Birm, NYC and Cal If R3 : Predicted value for Utah and Cal IIf R4 : ncremental value for 1974 for Charlottef R5 : ncremental value for 1974 for Birminghamf R6 : ncremental value for 1974 for NYCf R7 : ncremental value for 1974 for Utahf R8 : ncremental value for 1974 for California If R9 : ncremental value for 1974 for California IIf R10 : ncremental value for 1975 for Charlottef R11 : Incremental value for 1975 for Birminghamf R12 : Incremental value for 1975 for NYCf R13 : Incremental value for 1975 for Utahf R14 : Incremental value for 1975 for California If R15 : Incremental value for 1975 for California II
196
as do the 1975 effects for Charlotte and NYC. A model-
simplification that would substantiate these points
simultaneously can be stated:
C2 : l'R4='fR6='fR9='fR14=O,l'R11='fR14' l'R10=l'R12
The model implied by the constraints of C2 was fitted to the
estimates. This model can be expressed as:
E{F(n)} = XFl' F
where F(n) is the vector of six sets of the predicted
estimates ni=(n73,n74,n75) for each of the six areas
(i=1,2, ... 6), and l'F is the corresponding parameter vector.
The Qw for this 9 parameter model is QW=6.60 (d.f.=9), and
is indicative of an adequate fit. One can obtain the value
of the test statistic Qc for the simplification C2 by taking
the difference of the Wald statistics for the model XR
and
XF . Hence, QC=6.59-1.08=5.52, the degrees of freedom = 9
3=6, and the p-value of Qc is p=.480. Table 6.3.5 contains
the specification matrix ~, the resulting parameter
estimates and their standard errors, and also predicted
values, residuals, and p-values for z-scores constructed
from the residuals as discussed in Chapter V.
The estimates for the saturated model X for the
analysis of the females are displayed in Table 6.3.3. A
comparison of the area reference parameters shows that the
estimates for the females are consistently higher than for
males. No discernable pattern appears, however, when one
compares the parameters representing time effects. Similar
hypothesis testing and model
197
Table 6.3.5
Specification Matrix, Estimated Parameters and Standard Errorsfor Final Model XR for the Probability of Reporting Colds
in 1973, 1974 and 1975 for the Males Analysis
Specification Matrix
1 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 01 000 0 0 1 0 0o 1 0 0 0 0 0 0 0o 1 0 1 0 000 0o 1 000 0 0 1 0o 1 0 0 0 0 0 0 0o 1 000 000 0o 1 0 000 1 0 0o 0 1 0 0 0 000o 0 1 0 1 0 0 0 0o 0 1 0 0 0 0 0 1o 1 0 0 0 0 0 0 0o 1 000 1 000o 1 0 0 000 1 0001 0 0 000 0o 0 1 000 0 0 0o 0 100 0 000
Parameter InterpretationEstimates and
Standard Errors
Predicted value for CharlottePredicted value for Birm, NYC and Cal IPredicted value for Utah and California IIIncremental effect for 1974 for BirminghamIncremental effect for 1974 for UtahIncremental effect for 1974 for Cal IIncr. effect for 1975 for Char and NYCIncr. effect for 1975 for Birm and Cal IIncremental effect for 1975 for Utah
.530 ± .013
.429 ± .008
.475 ± .010
.082 ± .013
.055 ± .019
.042 ± .017-.064 ± .015
.040 ± .011
.054 ± .020
Area
CharlotteCharlotteCharlotteBirminghamBirminghamBirminghamNYCNYCNYCUtahUtahUtahCalifornia ICalifornia ICalifornia ICalifornia IICalifornia IICalofornia II
Year
197319741975197319741975197319741975197319741975197319741975197319741975
Fhat Std Err Residual P-value
.530 .013 .022 .112
.530 .013 -.016 .276
.460 .014 -.006 .494
.429 .008 .010 .276
.512 .011 .001 .472
.469 .009 -.000 .955
.429 .008 -.014 .588
.429 .008 -.025 .184
.365 .016 .006 .671
.475 .010 -.015 .314
.530 .017 -.002 .314
.529 .017 -.001 .314
.429 .008 -.002 .864
.471 .016 .000 .999
.469 .009 .002 .886
.475 .010 -.002 .896
.475 .010 .003 .875
.475 .010 .018 .251
•
198
simplification efforts to those employed for the males
analysis were performed for the females analysis. Table
6.3.6 contains the specification matrix and parameter
estimates for a ~educed model ~ for the females, and Table
6.3.7 contains the final model XF for the females, as well
as parameter estimates and standard errors. The table also
contains predicted values and residuals. The final model
has eight parameters, including three area reference
parameters, two incremental time effects for 1974 and four
incremental time effects for 1975. The goodness-of-fit
statistic for this model is QW=7.56, with 10 d.f .. It's
non-significance is supportive of the appropriateness of the
model (p-value=.67).
Thus, the forced similarity structure for the
predicted probabilities of reporting colds has been
accomodated with separate models for each sex. Smoothing is
accomplished across the area reference parameters for both
modeling efforts. Charlotte maintains its own reference
value in both cases, but various combinations of the other
parameters can be merged together. The three areas
Birmingham, NYC, and California I are put together for the
males, while the three-way smoothing for the females
consists of Birmingham, Utah, and California II. Both sexes
had significant 1974 effects for Birmingham; the males
included two other 1974 effects while the females had just
one other 1974 effect for NYC. Every area except California
II had significant time effects for 1975 for the males.
199
Table 6.3.6
Specification Matrix, Estimates Parameters and Standard Errorsfor Reduced Model XR for the Probability of ReportingColds in 1973, 197~ and 1975 for the Females Analysis
r Specification Matrix?o
,1 o 0 000 0 0 0 000 0 Estimates and1 o 0 1 0 0 0 0 0 0 0 0 o 0 0 Standard Errors1 o 0 0 0 0 0 0 0 1 0 0 o 0 00 1 0 0 0 0 0 0 0 0 0 0 o 0 0 l' fl .661 ± .0180 100 1 0 0 0 0 0 0 0 o 0 0 1'R2 .594 ± .0090 1 0 000 0 000 1 0 o 0 0 1'R3 .512 ± .0140 o 1 0 0 000 0 000 o 0 0 1'R4 .013 ± .0240 o 1 0 0 1 0 0 0 0 0 0 o 0 0 1'R5 .040 ± .0140 o 1 0 0 0 0 0 0 0 0 1 o 0 0 1'R6 -.054 ± .0251 o 0 0 0 0 0 0 0 000 o 0 0 1'R7 .016 ± .0191 o 0 000 1 000 0 0 o 0 0 1'f,.8 .019 ± .0201 o 0 0 0 0 0 000 0 0 1 0 0 1'R9 -.001 ± .0200 1 0 0 0 0 0 0 0 0 0 0 o 0 0 tRIO -.021 ± .0230 1 0 0 0 001 0 0 0 0 o 0 0 1'R,ll -.020 ± .014 e0 1 0 0 0 0 0 0 0 0 0 0 o 1 0 1'R12 .010 ± .0271 o 0 0 0 0 0 0 0 0 0 0 o 0 0 1'R13 .053 ± .0191 o 0 0 0 0 0 0 1 000 o 0 0 1'R14 .036 ± .0201 000 0 0 0 0 0 000 o 0 1 1'R15 -.045 ± .022
L
Parameter Interpretation1'RI: Predicted value for Charlotte1'R2: Predicted value for Birm, Utah and Cal II1'R3: Predicted value for NYC and Cal I1'R4: Incremental value for 1974 for Charlotte1'R5: Incremental value for 1974 for Birmingham1'R6: Incremental value for 1974 for NYC1'R7: Incremental value for 1974 for utah1'R8: Incremental value for 1974 for California I1'R9: xncremental value for 1974 for California II1'RI0: Incremental value for 1975 for Charlotte1'Rl1: Incremental value for 1975 for Birmingham1'R12: Incremental value for 1975 for NYC1'R13: Incremental value for 1975 for Utah1'R14: Incremental value for 1975 for California I1'R15: Incremental value for 1975 for California II
200
Table 6.3.7
Specification Matrix, Estimated Parameters and Standard Errorsfor Final Model XF for the Probability of Reporting Colds
in 1973, 1974 and 1975 for the Females Analysis
Specification Matrix
1 0 0 0 0 0 0 0• 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 00 1 0 0 0 0 0 00 1 0 1 0 0 0 00 1 0 0 0 1 0 00 0 1 0 0 0 0 00 0 1 0 1 0 0 00 0 1 0 0 0 0 00 1 0 0 0 0 0 00 1 0 0 0 0 0 00 1 0 0 0 0 1 00 0 1 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 00 1 0 0 0 0 0 00 1 0 0 0 0 0 00 1 0 0 0 0 0 1
Estimates andParameter Interpretation Standard Errors
Predicted value for CharlottePredicted value for Birm, Utah and Cal IIPredicted value for NYC and California IIncremental effect for 1974 for BirminghamIncremental effect for 1974 for NYCIncremental effect for 1975 for BirminghamIncr. effect for 1975 for UtahIncr. effect for 1975 for California II
.657 ±.. 597 ±.527 ±.038 ±
-.068 ±-.023 ±
.050 ±-.047 ±
.011
.008
.009
.013
.023
.013
.018
.021
CharlotteCharlotteCharlotteBirminghamBirminghamBirminghamNYCNYCNYCUtahUtahUtahCaliforniaCaliforniaCaliforniaCaliforniaCaliforniaCalofornia
Area Year
197319741975197319741975197319741975197319741975
I 1973I 1974I 1975II 1973II 1974II 1975
Qw = 7.56
Fhat Std Err Residual P-value
.657 .011 .0045 .752
.657 .011 .0173 .226
.657 .011 -.0162 .167
.597 .008 .0018 .855
.635 .011 .0002 .855
.574 .011 .0001 .855
.527 .009 -.0378 .155
.458 .021 -.0018 .496
.527 .009 -.0057 .784
.597 .008 -.0018 .904
.597 .008 .0137 .378
.647 .017 .0009 .598
.527 .009 -.0076 .574
.527 .009 .0046 .739
.527 .009 .0211 .101
.597 .008 -.0148 .405
.597 .008 -.0070 .700
.550 .020 -.00167 .632
P-value = .671 d.f. = 10
201
Charlotte and NYC shared one parameter, Birmingham and
California I shared another, and Utah had its own 1975
effect. Birmingham, Utah and California II had individual
time effects for the model for the females. When both
models are considered together, it appears that time is a
major factor in the estimates for Birmingham, while it is
least important for California II. The residuals are
reasonable for both models as judged by their p-values, and
the predicted values demonstrate that the estimated
proportions for the females are consistently higher than
they are for the males.
However, yet another model refinement is possible,
this time motivated by the structure of the predicted values
resulting from the 'final' models. It would appear that the
predicted values of Table 6.3.5 for the males could be
classified into one of four (time x location) clusters,
these being centered around the values .52, .47, .42, and
.36. One can assess the appropriateness of such a
classification scheme by fitting a linear model which
characterizes the original functions correspondingly. Koch,
Johnson, and Tolley (1972) discuss such a strategy in the
analysis of survival rates. Such a structure would thus
model similarities across both time and area, providing an
appropriate descriptive tool for this data, especially in
light of the interactions present. This model can be stated
as
•
202
where X represents the specification matrix suggested byp
the predicted values from the previous model. This
specification matrix, as well as the resulting estimates of
the parameters, are displayed in Table 6.3.8a. The Wald
goodness-of-f i t statistic is QW=.8. 77 (d. f. =14), which is
indicative of an adequate description of the data. Table
6.3.8b includes the analogous information when the same
strategy is applied to the females. In accordance with the
general trend that females report more colds than males, the
cluster values for females are higher than those for males
at .65, .60, .53 and .46. The goodness-of-fit statistic
for the females is also non-significant(Qw=13.30,d.f.=14).
The precise parameter interpretations are listed at the
bottom of the table.
6.3.2 Supplemental Margins as a Means of Extending the
Complete Data Sample Size
Table 6.3.9 contains the estimated probabililities of
reporting colds in 1973, 1974, and 1975 by data pattern
sources for the (area x sex) subpopulations which had
inadequate goodness-of-fit statistics for the model 6.3.3 in
Section 6.3.1. An alternative strategy for incorporating
the incomplete data into the supplemental margins analysis
is to consider the analysis as an extension to a complete
data analysis, and to include those estimates from the
double. and single patterns when they are similar to the
complete (triple) estimates. The estimates and standard
errors from the incomplete data groups were compared to
Table 6.3.8a
Specification Matrix, Parameter Estimates and PredictedValues for Cluster Model for the Proportion ofMales Reporting Colds in 1973, 1974 and 1975
203
Specification Matrix Xp1 0 0 01 0 0 00 1 0 00 0 1 01 0 0 00 1 0 0 Estimates and0 0 1 0 Standard Errors0 0 1 00 0 0 1
Ii.523 ± .007
0 1 0 0 .471 ± .0061 0 0 0 .429 ± .0081 0 0 0 .374 ± .0210 0 1 00 1 0 00 1 0 00 1 0 00 1 0 0 e0 1 0 0
Parameter Interpretation
Sl: Predicted value for Charlotte in 1973, 1974,Birmingham in 1974 and Utah in 1974 and 1975
S2: Predicted value for Charlotte in 1975, Birminghamin 1975, Utah in 1973, California I in 1974 and1975, and California II all three years
S3: Predicted value for Birmingham in 1973,NYC in 1973 and 1974, and California I in 1973
S4: Predicted value for NYC in 1975
Qw = 8.77 p-value = .8455 d.f. = 14
•
204
Table 6.3.8b
Specification Matrix, Parameter Estimates and PredictedValues for Cluster Model for the Proportion ofFemales Reporting Colds in 1973, 1974 and 1975
Specification Matrix Xp1 0 0 01 0 0 01 0 0 00 1 0 01 0 0 00 1 0 0 Estimates and0 0 1 0 Standard Errors0 0 0 10 0 1 0
I~.646 ± .007
0 1 0 0 .590 ± .0060 1 0 0 .531 ± .008
e 1 0 0 0 .459 ± .0210 0 1 00 0 1 00 0 1 00 1 0 00 1 0 00 0 1 0
Parameter Interpretation
~1: Predicted value for Charlotte, Birmingham in1974, and Utah in 1975
~2: Predicted value for Birmingham in 1973 and 1975,Utah in 1974 and 1975 and California II in 1973 and 1974
~3: Predicted value for NYC in 1973 and 1975, andCalifornia I & II in 1975
~4: Predicted value for NYC in 1974
Qw = 13.30 p-value = .503 d.f. = 14
Table 6.3.9
205
Charlotte Males
.::r~ P74 P75
73 74 75 .613 .507 .44273 14 .453 .459 p-value df73 75 .500 .37074 75 .529 .471 Q ALL -15.12 .0878 973 .570 QW' - 8.43 .3925 874 .541 QW' E
- 6.34 .5008 775 .479 W, CASE
Charlotte Females
P 73 P74 P75
73 74 75 .706 .698 .65373 74 .629 .685 p-value df73 75 .674 .45774 75 .561 .621 Q ALL -15.18 .0861 973 .630 QW'
E - 8.48 .3876 874 .667 QW'
CASE,. 8.45 .2949 7
75 .653W,
Birmingham Males
P73 P74 P75 e73 74 75 .373 .490 .46173 74 .485 .592 p-value df73 75 .482 .57174 75 .530 .470 Q ALL =24.27 .0039 973 .478 QW'
E = 5.04 .4116 574 .488 QW' ,. 2.69 .6107 475 .458 W. CASE
Utah Females
P73 P 74 P75
73 74 75 .563 .622 .68973 74 .688 .656 p-value df73 75 .545 .57674 75 .526 .684 Q ALL =21.46 .0108 973 .630 QW'
E,. 7.89 .3419 1
74 .612 QW'CASE • 7.40 .2851 6
75 .566W,
California II Males
P,3 P74 P,5
73 74 75 .489 .483 .54573 74 .496 .526 p-value df73 75 .297 .45974 75 .542 .500 Q ALL =17.41 .0418 973 .478 QW'
E = 5.89 .5526 174 .425 QW' = 5.12 .5285 675 .437 W. CASE
206
those for the completes (73,74,75) and when they did not
appear similar, (arbitrarily taken to be more than two
standard errors away), were dropped from the analysis. n 73 ,
n 74 , and.n 75 were re-estimated with the model (6.3.3)
modified by deletion of the rows of X corresponding to the
eliminated functions. The goodness-of-fit statistic was
then evaluated to determine whether the deletions were
appropriate. These statistics are written QW,E (for
element-wise deletion) and also appear in Table 6.3.9.
One may consider it prudent to delete from
consideration those estimates from an entire data pattern
group instead of just individual members, with the rationale
being that whatever process led to a questionable estimate
for one attribute may have been likely to affect others as
well. This may be considered a more conservative approach.
Thus, the estimation process was repeated a third time,
dropping those functions from an entire data pattern group
when any of its members was eliminated. The goodness-of-fit
stutistic for these models are written QW,CASE and also
appear in Table 6.3.9.
Seven of the ten element deletions (circled) are from
the double years, two of the single year deletions are for
1975 and the other is for 1973. Of the seven doubles
eliminations, three are from the 73,75 data pattern group,
which is the doubles group about which one might have the
most doubt. Note that there are a few cases where the
assumption that the 'complete' estimates are 'true' is
207
questionable, especially for Birmingham males for 1973. In
that situation, the incomplete data group estimates are
clustered at .485, .482 and .478, while the 13,74,75
estimate is .373. Table 6.3.10 contains the results of the
estimation of n13,n14,n75 using a modified version of
(6.3.3) for both the element-wise deletions and case-wise
deletions. The starred rows correspond to the
subpopulations where estimates changed. Probably the most
noticeable difference is the estimate for n 73 for Birmingham
males, which drops from .439 for the forced similarity
structure model to .384 when element-wise deletion was
performed. The next largest change is for n 75 for Utah
females, which increases from .648 in Table 6.3.1 to .688
for the element-wise deletions. The differences from the
estimates for element-wise deletion to those for case-wise
deletions are very minor, as might be expected since the
element-wise deletion process should have smoothed out most
of the inconsistent estimates.
The identity model expressed in (6.3.3) was fitted to
these element-wise estimates to gain a preliminary
assessment of the sources of variation. Table 6.3.11
contains the linear hypotheses concerning the resulting
parameter estimates, which are the predicted probability
functions themselves. The Wald statistics to evaluate these
hypotheses are also included. The results are very nearly
identical to those obtained for the same tests for the
forced similarity structure estimates, with a significant
Table 6.3.10
Estimates of n73' n74 and n75 by Area and Sex Adjustingfor Element-Wise Deletions and Case-Wise Deletions
208
Element-Wise Case-Wise
Area Sex n 73 n 74 n 75 n 73 n 74 n 75
Charlotte M .581 .509 .460 .581 .525 .461Charlotte F .661 .675 .650 .660 .675 .650Birmingham M .384 .502 .462 .375 .501 .462Birmingham F .599 .635 .574 .599 .635 .574NYC M .415 .404 .371 .415 .404 .371NYC F .489 .457 .521 .489 .457 .521Utah M .461 .528 .528 .461 .528 .528Utah F .591 .616 .688 .591 .613 .687California I M .427 .471 .471 .427 .471 .471California I F .519 .531 .548 .519 .531 .548California II M .485 .484 .528 .485 .484 .535e California II F .582 .590 .548 .582 .590 .548
209
Table 6.3.11
Results of Linear Hypotheses Concerning Estimates of theProbability of Reporting Colds in 1973, 1974 and 1975
Hypothesis
1. No difference between sexes foraverages over area*time
2. No variation among areas foraverages over sex*time
3. No variation among times foraverages over sex*area
4. Homogeneity across time for differencebetween sexes for average across areas
5. Homogeneity across area of differencesbetween sexes for averages across time
6. Homogeneity across area for differencebetween time for averages across sex
7. No sex*time*area interaction
Qc p-value df
1. 270.77 .00000 12 . 160.33 .00000 53. 5.84 .05370 14. 1. 73 .42086 25. 27.09 .00005 56. 1+1. 0 5 00000 1('1
7. 21. 86 .01580 10
210
three-way interaction surfacing, as well as a significant
(area x time) interaction and a significant (area x sex)
interaction.
Instead of pursuing linear model descriptions of the
estimates obtained using element-wise or case-wise deletion,
it was decided to investigate hypotheses concerning the
variation of time within area and area within time for each
of the six sets of estimates: the forced estimates for both
males and females, the element-wise deleted estimates for
males and females, and the case-wise deleted estimates for
males and females. If the different estimation procedures
led to vast differences in the variation among the
estimates, such differences would appear in the results of
such analysis.
Accordingly, identity cell mean models of the form
(6.3.3) were fit to each of these six groups of estimates of
rr 73 , rr 74 , rr 75 for the (area x sex) subpopulations.
Hypothesis tests of the form ct=O were then performed to
investigate the sources of variation of interest. These
tests are evaluated with Wald test statistics. Table 6.3.12
contains the C matrices used in conjunction with the various
tests, and Table 6.3.13 displays the resulting tests
statistics for each of the six groups of estimates.
On the whole, there does not appear to be much
difference in the results of these tests for the element
wise deletion estimates or case-wise deletion estimates
compared to each other or compared to the forced similarity
211
Table 6.3.12
Contrast Matrices for Tests of Time Differences WithinAreas and Area Differences Within Time for Estimates
of 1+ Colds via Supplemental Margins
1. No time differences in Charlotte
[ 6 o -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2. No time differences in Birmingham
[ g 0 0 1 o -1 0 0 0 0 0 0 0 0 0 0 0 0 ]0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0
3. No time differences in NYC
[ g 0 0 0 0 0 1 o -1 0 0 0 0 0 0 0 0 0 ]0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0
4. No time differences in Utah
[ g 0 0 0 0 0 0 0 0 1 0 -1 0 0 0 0 0 0 ]0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0
5. No time differences in California I e[ g 0 0 0 0 0 0 0 0 0 0 0 1 0 -1 0 0 0 ]0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0
6. No time differences in California II
[ g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 -1 ]0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1
7. No area differences in 1973
U0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0
]0 0 1 0 0 0 0 0 0 0 0 0 0 0 -1 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 -1 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 -1 0 00 0 0 0 0 0 0 0 0 0 0 1 0 0 -1 0 0
8. No area differences in 1974
[I 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0
]0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -1 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -1 00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 -1 00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -1 0
9. No area differences in 1975
[I 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1
]0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -10 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 -10 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -1
Table 6.3.13
Test Statistics and Corresponding P-Values forthe Linear Hypotheses Investigating Time Variation
Within Areas and Area Variation Within Time
212
Forcd Forcd Elem Elem Case CaseHypothesis df Males Femal Males Femal Males Femal
No time difference 2 13.79 2.24 19.40 1. 22 19.95 1. 22in Charlotte .0010 .3260 .0000 .5430 .0000 .5420
e No time difference 2 22.32 17.24 31. 39 17.24 33.70 17.24in Birmingham .0000 .0001 .0000 .0002 .0000 .0001
No time difference 2 2.07 4.96 2.07 4.96 2.07 4.96in NYC .3550 .0840 .3550 .0840 .3550 .0840
No time difference 2 10.99 5.52 10.99 14.60 10.99 14.80in Utah .0040 .0633 .0041 .0007 .0041 .0006
No time difference 2 5.83 1. 79 5.83 1. 79 5.83 1. 79in California I .0540 .4090 .0540 .4090 .0540 .4090
No time difference 2 .66 2.91 2.74 2.91 3.31 2.92in California II .7190 .2320 .2540 .2320 .1910 .2326
No area difference 5 33.59 47.51 53.44 47.20 55.97 45.48in 1973 .0000 .0000 .0000 .0000 .0000 .0000
No area difference 5 29.14 88.95 25.42 90.00 26.77 89.49in 1974 .0000 .0000 .0001 .0000 .0000 .0000
No area difference 5 35.10 43.18 39.83 57.40 40.59 57.11in 1975 .0000 .0000 .0000 .0000 .0000 .0000
213
estimates. As far as area variation goes, there are very
significant area differences for each year under
consideration for each set of estimates. When one examines
the results of the tests concerning time differences within
areas for the males, the differences are in degree of
magnitude and not statistical interpretation. There are
significant differences in time for Charlotte for all three
sets of estimates, with the test statistics for the element
wise deletions and case-wise deletions being larger than
that for the forced similarity estimates. The same
situation occurs for the time differences in Birmingham.
Meanwhile, the time differences for males in California II
are non-significant, although the test statistic has a
higher value for the element-wise and case-wise deletion
estimates. The other three areas for males have identical
results, since none of the relevant estimates were deleted.
The females do exhibit a difference between the forced
similarity estimates and the other estimates. The test for
time differences in Utah only approaches significance with a
p-value of .0633 for the forced similarity estimates, but
those for the element-wise deleted estimates and the case
wise deleted estimates are strongly significant with p <
.001. However, since the element-wise and case-wise deleted
estimates are a consequence of the debatable assumption that
the 'Complete' data are 'correct', one cannot judge which of
these results is correct. So, only one disparity can be
noted, and it is marginal. In general, the prevailing theme
214
is similar findings across methods. Certainly, using the
forced similarity estimates is a much easier process, and
possibly preferable for that reason.
However, on the whole, the use of supplemental margins
is a time-consuming and computing intensive process. Since
an alternative would be the use of multivariate ratio
estimation, the estimates of n 73 , n 74 , and n 75 were
calculated so that the two adjustment procedures could be
compared for this analysis. Future chapter sections will
describe the application of multivariate ratio estimation.
Table 6.3.14 contains the estimates of n 73 , n 74 , and
n 75 via supplemental margins and multivariate ratio
estimation, along with the corresponding standard errors, by
the (area x sex) sUbpopulations under study. The estimates
appear to be very similar. As would be expected, the fact
that supplemental margins uses a weighted estimation
procedure leads it to provide lower standard errors,
although the degree of difference is very small. To
summarize, the differences between the two procedures in
terms of producing these estimates appears to be minimal.
Since multivariate ratio estimation will be seen to be much
easier in terms of application, it might be the tool of
choice in a similar situation. However, the benefits of
supplemental margins may be worth the effort when the data
set is not very large. With the CHESS data, however, it
would appear that the use of multivariate ratio estimation
to adjust for missing data is satisfactory.
215
Table 6.3.14
Estimates of W73, W74 and W7S by Area and Sex by SupplementalMargins and Multivariate Ratio Estimation
Supplemental RatioMargins Estimation
Area Sex w73 w74 w75 w73 w74 w75
Charlotte M .552 .514 .460 .549 .509 .464(.019)(.019)(.017) (.020) (.020) (.017)
Charlotte F .661 .674 .640 .661 .672 .640(.018)(.018)(.016) (.018)(.018)(.016) eBirmingham M .439 .512 .469 .444 .508 .466(.012)(.011)(.011) (.013)(.013)(.011)
Birmingham F .599 .635 .574 .601 .632 .575(.013)(.011)(.011) (.013)(.011)(.011)
NYC M .415 .404 .371 .428 .410 .373(.028) (.021) (.021) (.029)(.021)(.021)
NYC F .489 .457 .521 .500 .467 .520(.028) (.021) (.023) (.029) (.022) (.023)
Utah M .461 .529 .529 .464 .531 .531(.017)(.018)(.018) (.018) (.018) (.018)
Utah F .595 .611 .648 .600 .615 .638(.017)(.017)(.017) (.017)(.018)(.017)
California I M .427 .471 .471 .425 .471 .471(.016) (.016) (.015) (.016) (.016) (.015)
California I F .519 .531 .548 .521 .535 .546(.016) (.017) (.016) (.016)(.017)(.016)
California II M .473 .478 .493 .478 .488 .490(.018)(.019)(.018) (.018) (.019) (.019)
California II F .582 .590 .548 .579 .592 .548(.019)(.020)(.020) (.020)(.021)(.020)
216
6.4 Multivariate Ratio Estimation for the Analysis of Mean
Colds for the Six Point Data
The final analysis for the complete data in the
preceding chapter was for mean colds in the years 1973,
1974, and 1975. Multivariate ratio estimation is a useful
technique for handling missing data in the analysis of
means, as well as other types of estimators. The sources of
data included in this analysis are the complete data
subgroups and "the doubles subgroups (since the data must
thus be present for six time points, it is referred to as
the six points data). Since three measures per year will be
estimated, this means that there will be one missing value
per observation for the doubles data, or (2131 + 1208 + 434
= 3773) missing altogether, where the numbers in parentheses
represent the number of observations for
(1973,1974),(1973,1975), and (1974,1975) respectively.
Thus, since there are 4002 complete observations, 3773/(3 *
7775), or 16 percent of the data will be missing. If the
single years were also included, a similar calculation would
show that nearly 50 percent of the observations would be
missing. Since it has been suggested that the amount of
missing data for this kind of adjustment scheme be limited
to around 10 percent (Stanish, Gillings, and Koch 1978), it
was decided not to include the single years in this
analysis. Also, unlike the supplemental margins analysis,
multivariate ratio estimation will not result in an
intermediate goodness-of-fit test with which one can assess
217
the quality of the estimates. By including the doubles, the
sample size for an analysis of mean colds is extended from
4002 to 7775, almost a 95 percent increase.
The shape of the analysis is partly constrained by the
abilities of current computer software. The program MISCAT
is a Fortran program which will generate the necessary ratio
estimates and perform asymptotic regression. However,
there is a limit to the number of functions which it can
handle. This is currently eighty, and must include
indicator functions for the presence or absence of data
values. Thus, the number of possible elements per function
vector is actually forty. This means that modeling three
response variables across the twenty-four subpopulations
generated by an (area x sex x race) cross-classification
would not be possible, since 72 functions would result.
However, one can choose to separate the analysis into
smaller pieces, and if desired, splice the resulting
parameter estimates and covariances back together for a
second modeling stage, much like the predicted estimates of
n 73 ,n 74 , and n 75 were modeled in a second stage in section
6.3. Since sex has shown itself to be a major source of
variation in previous work, it was decided to split the
analysis into two pieces--one for males, and one for
females.
Table 6.4.1 contains the estimates of mean colds for
i973 for males and females by area and race, and Tables
6.4.2 and 6.4.3 contain the estimates for 1974 and 1975.
Table 6.4.1
Mean Colds by Area, Sex and Race for 1973
MAL E SArea Race Mean N Missing
Charlotte White .80 274 43Charlotte Other .68 130 25Birmingham White .64 456 364Birmingham Other .39 245 211NYC White .56 119 83NYC Other .46 26 41Utah White .67 431 95Utah Other .32 38 11California I White .58 571 136California I Other .61 85 23California II White .13 421 87California II Other .52 40 8TOTAL .63 2836 1121
F E MAL E SCharlotte White 1.04 314 31
e Charlotte Other 1.15 123 35Birmingham White .93 500 358Birmingham Other .86 249 200NYC White .13 105 62NYC Other .61 31 42Utah White .96 432 52Utah Other .13 51 5California I White .80 504 113California I Other .67 85 29California II White .92 363 48California II Other .56 41 3TOTAL .90 2798 978
TOT A LCharlotte White .93 588 74Charlotte Other .91 253 60Birmingham White .19 956 722Birmingham Other .63 494 411NYC White .64 224 145NYC Other .54 51 83Utah White .81 863 147Utah Other .55 89 16California I White .68 1015 249California I Other .61 110 52California II White .82 184 135California II Other .54 81 11TOTAL .76 5634 2105
218
Table 6.4.2
Mean Colds by Area, Sex and Race for 1974
219
MAL -E SAREA Race Mean N Missing
Charlotte White .77 285 32Charlotte Other .76 141 14Birmingham White .82 770 50Birmingham Other .63 450 6NYC White .5"5 192 10NYC Other .48 64 3Utah White .80 488 36Utah Other .53 45 4California I White .67 667 40California I Other .60 101 7California II White .77 472 36California II Other .36 47 1TOTAL .72 3722 241
F E MAL E SCharlotte White 1.14 313 32Charlotte Other 1.08 144 14 eBirmingham White 1.02 828 30Birmingham Other 1.01 439 10NYC White .70 162 5NYC Other .58 71 2Utah White .97 455 29Utah Other .67 52 4California I White .83 583 34California I Other .59 106 8California II White .96 390 21California II Other .66 41 3TOTAL .94 3584 192
TOT A LCharlotte White .96 598 64Charlotte Other .92 285 28Birmingham White .92 1598 80Birmingham Other .82 889 16NYC White .62 354 15NYC Other .53 135 5Utah White .88 943 67Utah Other .61 97 8California I White .74 1250 74California I Other .60 207 15California II White .86 862 57California II Other .50 88 4TOTAL .83 7306 433
Tabl~, 6.4.3
Mean Colds by Area, Sex and Race for 1975
MAL E SArea Race Mean N Missing
Charlotte White .64 211 106Charlotte Other .45 102 53Birmingham White .75 717 103Birmingham Other .53 430 26NYC White .56 156 46NYC Other .46 59 8Utah White .85 453 73Utah Other .70 44 5California I White .70 627 80California I Other .67 101 7California II White .79 383 125California II Other .46 37 11TOTAL .70 3320 643
F E MAL E SCharlotte White .98 244 101e Charlotte Other .85 116 42Birmingham White .96 765 93Birmingham Other .80 422 27NYC White .76 129 38NYC Other .68 60 13Utah White 1.18 431 53Utah Other .82 45 11California I White .86 540 77California I Other .67 101 13California II White 1.03 329 82California II Other .68 38 6TOTAL .93 3220 556
TOT A LCharlotte White .82 455 207Charlotte Other .67 218 95Birmingham White .86 1482 196Birmingham Other .67 852 53NYC White .65 285 84NYC Other .57 119 21Utah White 1.01 884 126Utah Other .76 89 16California I White .78 1167 157California I Other .67 202 20California II White .90 712 207California II Other .57 75 17TOTAL .81 6540 1199
220
221
The only discernable trend would appear to be that females
consistently report more colds than males, a point which
supports the decision to model the sexes separately. Also
included in the tables are sample sizes and the number of
those missing values by area and sex. Sample sizes are
sufficient for the subsequent functional aSYmptotic
regression analysis, as the smallest number of non-missing
observations on which a mean estimate is based is 26 for
'other' males in NYC for 1973. Most of the other non
missing sample sizes are in the hundreds, one of the
benefits of such a large dataset. The use of asymptotic
results is thus justified.
Let Yilt = (Yi1l'Yi2l' ... Yi3l) represent the vector of
responses for the l-th subject in the i-th subpopulation,
where i=1,2, ... 12 (i=l indicates Charlotte whites, i=2
indicates Charlotte 'others,', and so on up to i=12 for
California II 'others'). This discussion is limited to
males. Let {Yikl } denote the number of colds for the k-th
year for the l-th person in the i-th subpopulation. Let Pi'
= (Pi1,Pi2,Pi3) represent the expected value of of Yilt. An
appropriate estimator for Pi which takes into account the
missing value structure of the data is the ratio estimator
(3.5.1) of Chapter III. If u il = (ui1l,ui2l,Ui3l), denotes
a random vector of indicator variables for whether or not
the k-th response value is present, the ratio estimator for
P ik can be expressed as
222
(6.4.1)
where f ikl = Yikluikl· If gil =
(fi11,fi21,fi31,uill,ui21,Ui31) and gi denotes the sample
mean vector of the gills, the i-th ratio estimator can be
written
( 6 • 4 • 2 )
with A = [I 3
eXp{Alog(9i )}
estimated covariance matrix of Yi
..
can be expressed as
•(6.4.3)
-1= D AD V D
'"" - -Yi gi gi
where the covariance matrix for 9i is calculated according
to expression (3.5.3) of Chapter III. The estimated Yi and
their corresponding standard errors are displayed in Table
6.4.4. The estimates for females are also included."..
If Y
is written y = (Y1 'Y2' .. 'Y12)' and has expected value P =(P 1 ,P2' •• ,P l2 )', then the estimated covariance of y,vy ,. can
be computed from (6.4.3) above, rePlacingYi byy, gi by g =
(gl' g2,···g12)', and Vgi by the matrix Vg' where Vg is a
diagonal block matrix with Vgi as th i-th block ....r"
Variation among the elements of y can be analyzed with
functional asymptotic regression methodology. Let a model
Table 6.4.4
Estimate and Standard Errors for Mean Colds in1973, 1974 and 1975 by Sex, Area and Race
223
Males Females •
Area Race 1973 1974 1975 1973 1974 1975 eCharlotte White Yl .803 .768 .635 1.045 1.144 .980Charlotte Other Y2 .685 .759 .451 1.155 1.083 .853Birmingham White Y3 .643 .816 .755 .930 1.016 .957Birmingham Other Y4 .388 .627 .535 .863 1.011 .801NYC White Y5 .563 .552 .564 .733 .704 .760NYC Other Y6 .462 .484 .458 .613 .578 .683Utah White Y7 .668 .801 .848 .958 .970 1.183Utah Other Y8 .316 .533 .705 .725 .673 .822California I White Y9 .581 .667 .700 .802 .834 .863California I Other Y10 .671 .604 .673 .671 .594 .673California II White Yll .734 .773 .794 .920 .964 1.033California II Other Y12 .525 .362 .459 .561 .659 .684
224
for y be stated as
(6.4.4)
where X denotes a (u x t) specification matrix of interest
and ~ represents a (t x 1) vector of unknown parameters.
The sample sizes involved with these subpopulations are
~
sufficiently large such that y is approximately multivariate
normal. The preliminary model
was again used as a starting point to assess sources of
variation among the estimates. Table 6.4.5 contains the
results of linear hypothesis tests targeted at investigating
potential sources of variation. The Wald statistic QC is
presented for each hypothesis, along with the d.f. and p-
value.
There is a very significant race effect, as well as a
significant area effect (0=.05 level of significance). Time
is borderline. The three-way interaction is not
significant, and neither is the two-way interaction of race
and time. There is however, a very significant interaction
of time and area (QC = 50.86, d.f.=10), as well as a
significant (race x area) interaction (Qc = 13.81, d.f.=5).
Accordingly, the model structure chosen as the framework for
continued linear modeling was one in which effects for race
and time were fit within areas. The area reference
parameters are the predicted mean functions for 1913. Table
6.4.6a contains the specification matrix and Table 6.4.6b
contains the resulting parameter estimates and standard
Table 6.4.5 .
Hypotheses and Resulting Test Statistics Concerning MeanColds in 1973, 1974 and 1975 for Analyses of Both Sexes.
Hypotheses
1. No difference between races foraverages over area x time
2. No variation among areas foraverages over time x race
3. No variation among times foraverages over race x area
4. Homogeneity across time for differenceamong areas for average across race(i.e. no area x time interaction)
5. Homogeneity across area for differencesbetween races for averages across time
6. Homogeneity across time for differencebetween races for averages across area
7. No race x time x area interaction
Males FemalesQc P-VALUE Qc P-VALUE DF
1. 31.35 .000 28.05 .000 12 . 14.52 .013 67.75 .000 53. 5.80 .055 0.84 .656 24. 50.86 .000 35.43 .000 105. 13.81 .017 12.83 .025 56. 0.05 .972 1. 55 .461 27. 10.52 .396 6.35 .785 10
225
226
Table 6.4.6a
Specification Matrix for Model X2 for Mean Colds in1973, 1974 and 1975 for Males and Females
Specification Matrix X2
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
e 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1
Parameter Interpretation131, 135, 139, 1313, 1317, 1321: Predicted value for 1973 for
Charlotte, Birmingham, NYC, Utah, Cali~ornia I & II132,136,1310,1314,1318,1322: Inc. effect for 'other' race
for Charlotte, Birmingham, NYC, Utah, California I & II133, 137, 1311' 1315' 1319' 1323: Inc. effect for 1974 for
Charlotte, Birmingham, NYC, Utah, California I & II134, 138, 1312, 1316, 1320, 1324: Inc. effect for 1975 for
Charlotte, Birmingham, NYC, Utah, California I & II
Table 6.4.6b
Estimates and Standard Errors for Model X2 for MeanColds in 1973, 1974 and 1975 for Males and Females
227
Males Females
fl 1 .803 ± .046 1.082 ± .049fl 2 -.115 ± .058 -.028 ± .068fl 3 .006 ± .054 .048 ± .055fl 4 -.193 ± .059 -.141 ± .060fl 5 .626 ± .032 .932 ± .037fl 6 -.218 ± .036 -.079 ± .042fl 7 .201 ± .033 1.077 ± .037fl 8 .127 ± .034 -.005 ± .039 efl 9 560 ± .066 .733 ± .072fl 10 -.090 ± .071 -.113 ± .098fl 11 -.005 ± .071 -.035 ± .079fl 12 -.001 ± .077 .038 ± .083fl 13 .655 ± .038 .968 ± .044fl 14 -.300 ± .075 -.290 ± .090fl 15 .146 ± .045 .001 ± .050fl 16 .207 ± .049 .204 ± .053fl 17 .595 ± .032 .810 ± .038fl 18 -.011 ± .068 - .196 ± .061fl 19 .061 ± .038 .016 ± .041fl 20 .101 ± .040 .054 ± .046fl 21 .754 ± .041 .917 ± .047fl 22 -.364 ± .095 -.342 ± .094fl 23 .007 ± .045 .052 ± .052fl 24 .034 ± .051 .115 ± .064
228
errors. The goodness-of-fit statistic is QW = 10.77
(d.f.=12), and is indicative of an adequate linear
description of the mean cold estimates.
The modeling process continues with effort towards
model simplification so that succinct linear
parameterizations of the estimates can be determined for
descriptive purposes. Further stages consist of testing
linear hypotheses concerning parameter estimates and fitting
the reduced model implied by the results of such hypothesis
testing. Attention was first directed at the parameters
reflecting time effects. An intermediate model X3 was
fitted which included a reduction in the number of
parameters for time from twelve to five. Eliminated were
the 1974 effects for Charlotte, NYC, California I and
California II, as well as the 1975 effects for NYC and
California I. The 1974 parameters for Birmingham and Utah
were combined, as were the 1975 parameters for Brimingham
and California I.
Subsequent effort was aimed at smoothing race
parameters and then area parameters. Race effects for Utah
and California I proved to be non-significant, as was
predicted from their estimated values for model X2 . The
parameters for Birmingham, Utah and California II were
smoothed into one, leaving the incremental effect for
Charlotte as the only 'stand-alone' race parameter. The
final reduced model reflected area parameter smoothing as
the last step in the model-fitting process. Charlotte and
229
California II were combined, and NYC and California I were
also combined. This ten parameter model is displayed in
Table 6.4.7a, along with the parameter estimates and their
standard errors'. The goodness-of-fit for this model is Qw =
20.05 (d.f.=26). The p-value for Qw is .79.
The analysis of the females proceeded in a similar
fashion to that for the males. Table 6.4.4 also contains
the estimated Yi for the females in the respective (race x
area) subpopulations. A comparison with the males shows
that the females consistently reported more colds. No other
trend is readily apparent from comparing the table of
estimates. A preliminary investigation of the sources of
variation among these estimates was pursued by fitting the
identity model to the female estimates and utilizing
hypothesis tests concerning the resulting parameter
estimates. These tests revealed no substantial differences
from the results obtained in the analysis for the males.
Table 6.4.5 contains the results of these hypothesis tests.
The significant two-way interactions are (race x area) and
(time x area). The three-way interaction is non
significant. The only different result from those for the
males is that the average effect for time is quite non
significant, where it was borderline non-significant for
males(~=.05 level of significance). This test has limited
meaning, however, in the face of a very substantial
interaction of time with area.
Consequently, the same 24 parameter reduced model X2
Table 6.4.7a
Final Model for Mean Colds (for males) in1973, 1974 and 1975 by Area and Race
230
Specification Matrix ~
1 0 0 0 0 0 o 0 0 01 0 0 0 0 0 o 0001 0 0 0 0 0 0 1 o 01 0 0 1 0 0 o 0 0 01 0 0 1 0 0 o 0001 0 0 1 0 0 o 1 o 00 1 0 0 0 0 o 0 0 00 1 0 0 0 1 o 0 0 00 1 0 0 0 0 o 0 1 00 1 0 0 1 0 o 0000 1 0 0 1 1 000 00 1 0 0 1 0 o 0 1 0 Estimates and0 0 1 0 0 0 o 0 0 0 Standard Errors0 0 1 0 0 0 o 0 0 00 0 1 0 0 0 o 000 1 : .773 ± .0230 0 1 0 0 0 o 0 0 0 Z : .641 ± .0230 0 1 0 0 0 o 0 0 0 3 : .564 ± .0210 0 1 0 0 0 o 0 0 0 4 : -.089 ± .0520 1 0 0 0 0 o 0 0 0 5 : -.247 ± .0300 1 0 0 0 1 o 0 0 0 6 : .184 ± .0250 1 0 0 0 0 000 1 7 : .084 ± .032e 0 1 0 0 1 0 o 0 0 0 a : -.178 ± .0450 1 0 0 1 1 o 0 0 0 9 : .122 ± .0230 1 0 0 1 0 000 1 1 0 : .222 ± .0430 0 1 0 0 0 o 0 0 00 0 1 0 0 0 1 0000 0 1 0 0 0 o 0 1 00 0 1 0 0 0 o 0 0 00 0 1 0 0 0 1 0000 0 1 0 0 0 001 01 0 0 0 0 0 000 01 0 0 0 0 0 000 01 0 0 0 0 0 o 0 0 01 0 0 0 1 0 o 0 0 01 0 0 0 1 0 o 0 0 01 0 0 0 1 0 000 0
Parameter Interpretation
SI : Predicted value for 1973 for Charlotte and Cal IISZ : Predicted value for 1973 for Birmingham and UtahS3 : Predicted value for 1973 for NYC and Cal~fornia IS4 : Incremental effect for race for CharlotteSs : Incremental effect for race for Birm, Utah and Cal IIS6 : Incremental effect for 1974 for Birmingham and UtahS7 : Incremental effect for 1974 for California ISa : Incremental effect for 1975 for CharlotteS9 : Incremental effect for 1975 for Birmingham and Cal ISl 0 : Incremental effect for 1975 for Utah
Qw = 20.05 p-value = .789 d.f. = 26
Table 6.4.7b
Final Model for Mean Colds (for females) in1973, 1974 and 1975 by Area and Race
231
Specification Matrix XF1 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 01 0 0 0 0 0 1 0 01 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 01 0 0 0 0 0 1 0 0 '"0 1 0 0 0 0 0 0 00 1 0 0 0 1 0 0 00 1 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 1 0 0 0 1 0 0 00 1 0 0 0 0 0 0 0 Estimates and0 0 1 0 0 0 0 0 0 Standard Errors0 0 1 0 0 0 0 0 00 0 1 0 0 0 0 0 0 I 1.096 ± .0360 0 1 0 0 0 0 0 0 2 .925 ± .0180 0 1 0 0 0 0 0 0 3 .795 ± .0240 0 1 0 0 0 0 0 0 4 -.287 ± .0630 1 0 0 0 0 0 0 0 s - .153 ± .0590 1 0 0 0 0 0 0 0 s .099 ± .0270 1 0 0 0 0 0 1 0 , -.168 ± .0510 1 0 1 0 0 0 0 0 • .226 ± .0430 1 0 1 0 0 0 0 0 9 .083 ± .0510 1 0 1 0 0 0 1 00 0 1 0 0 0 0 0 00 0 1 0 0 0 0 0 00 0 1 0 0 0 0 0 0 e0 0 1 0 1 0 0 0 00 0 1 0 1 0 0 0 00 0 1 0 1 0 0 0 00 1 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 1 0 0 0 0 0 0 10 1 0 1 0 0 0 0 00 1 0 1 0 0 0 0 00 1 0 1 0 0 0 0 1
Parameter Interpretation
SI: Predicted value for 1973 for CharlotteSz: Predicted value for 1973 for Birm, Utah and Cal IIS3: Predicted value for 1973 for NYC and California IS4: Incremental effect for race for Utah and California IISs: Incremental effect for race for California ISs: Incremental ~ffect for 1974 for BirminghamS,: Incremental effect for 1975 for CharlotteS.: Incremental effect for 1975 for UtahS9: Incremental effect for 1975 for California II
Qw = 30.54 p-value = .29 d.f. = 27
232
was fitted for females as was for males. There are six
predicted reference values for areas, six race effects
within areas, and also an effect for 1974 and 1975 within
each area. The resulting parameter estimates are displayed
in Table 6.4.6, beside the analogous parameter estimates for
the males. A cursory examination shows that there are
similarities in the estimates, at least in direction if not
always in magnitude.
The process of hypothesis testing and fitting reduced
models led to the final model displayed in Table 6.4.7b, It
consists of predicted reference values in 1973 for Charlotte
separately, Birmingham and Utah and California II combined,
and the other is for NYC and California I. The only time
effect for 1974 remaining is one for Birmingham, and 1975
time effects in the model are those for Charlotte, Utah, and
California II.
6.5 Analysis of the Proportion of those Reporting Asthma in
1973, 1974, and 1975 with Multivariate Ratio Estimation
Table 6.5.1 contains estimates of the proportions of
CHESS sUbjects who reported incidents of asthma in the years
1973, 1974, and 1975. These estimates are from the entire
dataset, not just those having six points as were the focus
of the analysis in the preceeding section. Consequently,
much of the data for this analysis will be incomplete. The
table includes tabulations of the number of observed and
missing values by each (area x sex) subpopulation. More
than half of the data values are missing for the estimation
Table 6.5.1
Proportion of CHESS Subjects Reporting Incidentsof Asthma in 1973, 1974, and 1975
233
197 3Area Sex Mean N Missing
Charlotte Male .06 648 820Charlotte Female .04 672 765Birmingham Male .07 1553 2168Birmingham Female .05 1453 1921NYC Male .05 297 704NYC Female .05 304 628Utah Male .05 801 662Utah Female .04 811 600California I Male .09 990 660California I Female .05 917 644California II Male .09 730 546California II Female .05 630 417TOTAL .06 9806 10535
1 974Charlotte Male .06 633 835 eCharlotte Female .05 646 791Birmingham Male .06 1958 1763Birmingham Female .05 1869 1505NYC Male .06 547 454NYC Female .04 499 433Utah Male .05 791 672Utah Female .03 769 642California I Male .07 940 710California I Female .06 867 694California II Male .10 674 602California II Female .06 569 478TOTAL .06 10762 9579
1 9 7 5Charlotte Male .07 858 610Charlotte Female .06 870 567Birmingham Male .08 2004 1717Birmingham Female .07 1950 1424NYC Male .06 510 491NYC Female .05 465 467Utah Male .06 796 667Utah Female .05 760 651California I Male .11 1061 589California I Female .07 970 591California II Male .11 718 558California II Female .06 598 449TOTAL .07 11560 8781
234
of 1+ asthma in 1913. Individually, missing data
percentages range from forty percent for California II
females to seventy percent for NYC males. This may be
considered entirely too much missing data from any type of
conservative standpoint. However, the sample sizes on which
the estimates are based are qUite large, ranging from 291 to
1958, and this much existing data may serve to offset the
disadvantages of so much incomplete data. The only striking
trend among the estimates would appear to be that males
consistently report more asthma than females. This is in
contrast to previous findings for colds, where females
consistently reported more incidents than males. These
estimates and their standard errors are also reported in
Table 6.5.1.(.r. ""'"" _ ...,... •
Let Yi = (Yi1'Yi2'Yi3) represent the multivar~ate
ratio estimator of the proportion reporting 1+ asthma in
1913, 1914, and 1915 (actually this is the mean estimate of
the (0,1) indicator variable indicating the presence or
absence of asthma symptoms in 1913, 1914, and 1915),
(i=1,2, ... 12) where i=l indicates Charlotte males, 2
indicates Charlotte females, and so on up to i=12 for
California II females.)...,..Yik is calculated via expression
(6.4.1) of the previous section, and the covariance matrix
Vy is calculated directly from (6.4.3). Consequently,
...,.., - .......forming y = (Y1'Y2' ...Y12)', the variation among the
estimates of the proportions with 1+ asthma can be modeled
with
235
'"E {y} = ~ = X~A'"where ~ is the expected value of y, X is the specification
matrix of interest and ~ is the unknown parameter vector.
The estimated covariance matrix fory is written Vyand is
the diagonal block matrix with the i-th block taken from
expression (6.4.3). The first model of interest is the
identity model. which can be formally stated:
V'
EA{y} = X~ = I~ = ~
Linear hypotheses concerning the elements of ~ were
subsequently tested via Wald statistics. The results of
such testing are displayed in Table 6.5.2. There is no
three-way interaction, and the only significant (a=.05 level
of significance) two-way interaction is that of area and
sex. Qc for this test is Qc = 12.17(d.f.=5} with a p-value
of .03.
These results indicate that a promising model
structure to investigate is the one in which predicted
reference values are used for each area, incremental effects
for females are used for each area, and effects for 1974 and
1975 apply to all. This model can be stated:
EA{y} = ~ = X2~
and the resulting parameter estimates and standard errors
are displayed in Table 6.5.3, along with the specification
matrix. The Wald goodness-of-fit statistic for this model
is Qw = 14.04 (d.f.=22), which has a p-value of .90.
The table also includes the results of linear hypotheses
concerning the elements of the parameter vector~. The test
Table 6.5.2
Hypotheses and Resulting Test Statistics ConcerningProportions of 1+ Asthma Reported in 1973, 1974 and 1975
Hypothesis
1. No difference between sexes foraverages over area*time
2. No variation among areas foraverages over sex*time
3. No variation among times foraverages over sex*time
4. Homogeneity across areas of differencebetween sexes for average across time
5. Homogeneity across time of differencesbetween sexes for averages across area
6. Homogeneity across area for differencebetween time for averages across sex
7. No race*time*area interaction
QC P-value df
1. 31.07 .0000 12. 32.77 .0000 53. 24.42 .0000 24. 12.17 .0325 55. 0.27 .8700 26. 8.55 .5753 107. 5.12 .8830 10
236
237
Table 6.5.3
Specification Matrix, Estimated Parameters and StandardErrors for Model X2 for the Probability of
Reporting Asthma in 1973, 1974 and 1975
Specification Matrix
1 0 0 0 0 0 0 0 000 0001 0 0 0 0 0 0 0 0 0 0 0 1 01 0 0 0 0 0 0 000 0 0 0 11 1 0 0 0 0 0 0 0 0 0 0 0 01 100 0 0 0 0 0 0 0 0 1 01 1 0 0 0 0 0 0 0 0 000 1o 0 1 0 0 0 0 0 0 0 0 0 0 0o 0 1 0 0 0 0 0 0 0 001 0o 0 1 0 0 0 0 0 0 0 0 001001 1 0 000 0 0 0 0 0 0001 1 000 0 0 0 0 0 1 0001 1 000 0 0 0 000 1000 0 1 000 0 0 0 0 0 0000 0 1 0 0 0 0 000 1 0000 0 1 0 0 0 000 0 0 1o 0 001 100 0 0 0 0 0 000001 100 0 0 0 0 1 0o 0 001 1 0 000 0 0 0 100000 0 1 0 0 0 0 0 0 0000 0 0 0 1 000 0 0 1 0000 0 0 0 1 0 0 0 0 0 0 1o 0 0 0 0 0 1 1 0 0 0 0 0 0000 0 001 1 000 0 1 0000 0 0 0 1 1 0 0 0 0 0 1o 0 0 0 0 000 1 0 0 0 0 0o 0 0 0 0 0 0 0 1 000 1 0o 0 0 0 0 0 0 0 1 000 0 100 0 0 0 0 0 0 1 1 0 0 0 0000 0 0 000 1 100 1 0000 0 0 0 0 0 1 100 0 1o 0 0 0 000 0 0 0 1 000o 0 0 0 0 0 0 0 0 0 1 0 1 0000 000 0 0 0 0 1 001000 0 0 0 0 0 0 0 1 10000000 0 0 0 0 0 1 1 1 0o 0 0 0 0 0 0 000 1 101
Parameter Interpretation
1234567
•91 01 11 21 314
Estimates andStandard Errors
.0589 ± .0062-.0123 ± .0083
.0620 ± .0044-.0111 ± .0054
.0525 ± .0074-.0092 ± .0097
.0506 ± .0057-.0110 ± .0080
.0854 ± .0071-.0298 ± .0090
.0968 ± .0085-.0466 ± .0110-.0026 ± .0027
.0139 ± .0030
1:31' 1:33' 1:38' 1:37' 1:39' 1:31 1 : Predicted values for 1973 forCharlotte, Birmingham, NYC, Utah, California I and II
1:32,1:34,1:36,1:3., ~IO' 1:312: Incremental effect for femalesfor Charlotte, Birmingham, NYC, Utah, California I and II
1:313: Incremental effect for 19741:314: Incremental effect for 1975
Linear Hypothesis QC P-value df
H1 : 1:3 = 1:3 4 = 1:3 6 = ~8 .059 .962 32 1.101 .294 1H2 : 1:3 10 = 1:3 12H3 : 1:3 13 .= 0 .915 .339 1
238
investigating whether the tim7 effect for 1974 is null,
HO:~13=O, can be assessed with the Wald statistic QC =
b'C'(CVbC,)-lCb where b is the estimate for ~, Vb is its
estimated covariance matrix, and C is the contrast matrix
C = [ 0 0 0 0 0 0 000 0 1 0 ]
For the hypothesis H3 , Qc = .915 (1 d.f.) and its p-value is
.34. Additional hypotheses examined were
or the equivalence of the sex effects for Charlotte,
Birmingham, NYC and Utah, and
H2 : ~10 = ~12
the equivalence of the sex effects for California I and
California II. Neither of these tests are contradicted, and
the model X3 incorporating the implied parameter smoothing
is displayed in Table 6.5.4.
This nine parameter model includes the six area
predicted reference values, one incremental effect for sex
for California I and California II, another incremental sex
effect for the other four areas, and an overall incremental
effect for 1975. The goodness-of-fit statistic for the
model is QW = 16.45 (d.f.=21) and its p-value is .94. The
linear hypotheses concerning the area reference parameters
can be stated:
and were tested with a C matrix of the form
239
Table 6.5.4
Specification Matrix, Estimated Parameters and StandardErrors for Model X3 for the Probability of
Reporting Asthma in 1973, 1974 and 1975
Specification Matrix
1 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 11 0 0 0 0 0 1 0 01 0 0 0 0 0 1 0 01 0 0 0 0 0 1 0 10 1 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 1 0 0 0 0 0 0 10 1 0 0 0 0 1 0 o·0 1 0 0 0 0 1 0 00 1 0 0 0 0 1 0 10 0 1 0 0 0 0 0 00 0 1 0 0 0 0 0 00 0 1 0 0 0 0 0 10 0 1 0 0 0 1 0 00 0 1 0 0 0 1 0 00 0 1 0 0 0 1 0 10 0 0 1 0 0 0 0 00 0 0 1 0 0 0 0 00 0 0 1 0 0 0 0 10 0 0 1 0 0 1 0 00 0 0 1 0 0 1 0 00 0 0 1 0 0 1 0 1 e0 0 0 0 1 0 0 0 00 0 0 0 1 0 0 0 00 0 0 0 1 0 0 0 10 0 0 0 1 0 0 1 00 0 0 0 1 0 0 1 00 0 0 0 1 0 0 1 10 0 0 0 0 1 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 1 0 0 10 0 0 0 0 1 0 1 00 0 0 0 0 1 0 1 00 0 0 0 0 1 0 1 1
Estimates andParameter Interpretation Standard Errors
SI : Predicted value for 1973 for Charlotte .0570 ± .0046Sz : Predicted value for 1973 for Birmingham .0605 ± .0034S3 : Predicted value for 1973 for NYC .0518 ± .0053S4 : Predicted value for 1973 for Utah .0492 ± .0044Ss : Predicted value for 1973 for Cal I .0881 ± .0061Ss : Predicted value for 1973 for Cal II -.0896 ± .0067S7 : Incremental effect for females for Utah, -.0111 ± .0037
Birmingham, NYC and CharlotteS. : Incr. effect for females for Cal I « II -.0366 ± .0069Se : Incremental effect for 1975 .0156 ± .0024
Q = 16.45 p-value = .9438 d. f. = 27w
Linear Hypotheses Q p-value d.f.SI=SZ' S3 =S4 , Ss =Ss .73 .867 3
240
r ,I 1-1 o 0 0 0 0 0 0 I
C = I I• I 0 0 1-1 0 0 0 0 0 I
I II 0 0 o 0 1-1 0 0 0 IL ~
and found to be reasonable. (QC = .13,d.f.=3, p-value=.861)
The constraint imposed by H4 was incorporated into a
final model. This can be stated as:
'"EA{Y} = X4Pwhere X4 is the (24 x 6) specification matrix displayed in
Table 6.5.5, and P is the (6 x 1) parameter vector, the
estimates for which are also shown in Table 6.5.5, along
with their standard errors. The Wald goodness-of-fit
statistic for this model is Qw = 11.38 (d.f. = 30, p
value=.91). There is a noticeable difference in the values
of the area predicted reference values for 1+ asthma in
1913. The values of .081 for the two California areas is
much higher than the other estimated values of .0594 and
.0502. The incremental effects for females are both
negative in direction, with the effect for California being
much larger in magnitude than that for the other areas and
thus serving to bring the female estimates of 1+ asthma in
California more in line with those in the other areas. The
remaining parameter is the average incremental effect for
1915. Predicted and observed proportions are displayed in
Table 6.5.5
Specification Matrix, Estimated Parameters and StandardErrors for Final Model X. for the Probability of
Reporting Asthma in 1973, 1974 and 1975
241
Specification Matrix
1 0 0 0 0 01 0 0 0 0 01 0 0 0 0 11 0 0 1 0 01 0 0 1 0 01 0 0 1 0 11 0 0 0 0 01 0 0 0 0 01 0 0 0 0 11 0 0 1 0 01 0 0 1 0 01 0 0 1 0 10 1 0 0 0 00 1 0 0 0 00 1 0 0 0 10 1 0 1 0 00 1 0 1 0 00 1 0 1 0 10 1 0 0 0 00 1 0 0 0 00 1 0 0 0 10 1 0 1 0 00 1 0 1 0 0 e0 1 0 1 0 10 0 1 0 0 00 0 1 0 0 00 0 1 0 0 10 0 1 0 1 00 0 1 0 1 00 0 1 0 1 10 0 1 0 0 00 0 1 0 0 00 0 1 0 0 10 0 1 0 1 00 0 1 0 1 00 0 1 0 1 1
Estimates andParameter Interpretation Standard Errors
131 : Predicted value for Birm and Char .0594 ± .0031I3z : Predicted value for NYC and Utah .0502 ± .0036133 : Predicted value for California I and II .0887 ± .0054134 : Incremental effect for females -.0110 ± .0037136 : Incr. effect for females for Cal I « II -.0366 ± .0069136 : Incremental effect for 1975 .0156 ± .0024
QW = 17.18 P-value = .9704 d. f. = 30
242
Table 6.5.6, along with the corresponding standard errors.
When this model is compared to the model for 1+ asthma
discussed in Chapter V and illustrated in Table 6.5.5,
certain similarities emerge. The parameters for California
I and California II can be combined, a situation that has
not occurred for any of the analyses for colds. Also, there
is no inclusion of time effects for 1974 in the model for
the complete data analys"is, and the 1975 effect included is
averaged over all areas as in the present analysis, although
It is restricted to males. Thus, even though the current
analysis is based on roughly twice as many observations per
year's estimates, there are similar patterns in each
analysis.
Table 6.5.6
Observed and Predicted Proportions of ReportingAsthma in 1973, 1974 and 1975
242
Observed Predicted
Area Sex 1973 1974 1975 1973 1974 1975
Charlotte M .0556 .0600 .0734 .0594 .0594 .0750Charlotte F .0432 .0480 .0609 .0484 .0484 .0640Birmingham M .0663 .0567 .0753 .0594 .0594 .0750Birmingham F .0496 .0471 .0682 .0484 .0484 .0640NYC M .0539 .0567 .0588 .0502 .0502 .0658NYC F .0493 .0441 .0495 .0392 .0392 .0548Utah M .0499 .0506 .0628 .0502 .0502 .0658Utah F .0444 .0338 .0526 .0392 .0392 .0548California I M .0859 .0745 .0110 .0887 .0887 .0104California I F .0534 .0554 .0701 .0522 .0522 .0677 eCalifornia II M .0932 .1009 .0110 .0887 .0887 .1043California II F .0492 .0580 .0585 .0522 .0522 .0677
Standard Errors
Observed Predicted
Charlotte M .0090 .0094 .0089 .0031 .0031 .0034Charlotte F .0078 .0084 .0081 .0029 .0029 .0033Birmingham M .0063 .0052 .0059 .0031 .0031 .0034Birmingham F .0057 .0049 .0057 .0029 .0029 .0034NYC M .0131 .0099 .0104 .0036 .0036 .0039NYC F .0124 .0092 .0101 .0037 .0037 .0040Utah M .0077 .0078 .0086 .0036 .0036 .0039Utah F .0072 .0065 .0081 .0037 .0037 .0040California I M .0089 .0086 .0096 .0054 .0054 .0056California I F .0074 .0078 .0082 .0044 .0044 .0047California II M .0108 .0116 .0117 .0054 .0054 .0056California II F ~0086 .0098 .0096 .0044 .0044 .0047
CHAPTER VII
DISCUSSION
This dissertation has attempted to illustrate an
integrated approach to the analysis of categorical data,
especially when the dataset is large and the analytic
objectives ambiguous in terms of pre-stated hypotheses. The
need for variable selection is especially important when the
dataset at hand is large, as it can be physically and
computationally impossible to include all possible variables
in all desired analyses. The variable selection procedure
described in Chapter IV is very suitable for use with a
large dataset with a large number of candidate variables;
the extension of this procedure to include multivariate
randomization statistics to evaluate associations of
candidate variables with mUltiple response variables is
useful in a repeated measurements situation when there is a
desire to continue analysis efforts with multivariate linear
models methods. The variable selection performed in Chapter
IV was strictly mechanical; certainly the analyst may have
substantive information which may make it reasonable to
include one or more explanatory variables in subsequent
analyses regardless of the results of any variable selection
scheme. For example, even though the two level variable
RACE2 emerged as a variable to 'include' from the variable
244
selection for the outcome variable 2+ colds, one may have
thought it more prudent to use the three level variable
RACE3 to investigate race effects. However, for those
situations where one has no a priori ideas on which
variables to include, the scheme presented in Chapter IV
would appear to be very reasonable. In linear models
analysis of categorical data, as has been pointed out in
previous chapters, one is limited in the number of
independent variables one can model at one time, so a good
variable selection process is perhaps more important here
than in other data analysis situations where the outcome
measures are not categorical.
The second phase of analysis is the development of
linear models to describe the variation among the estimates
of interest for the subpopulations formed on the basis of
the independent variables selected in the first phase.
Weighted least squares regression is applied to produce
parameter estimates for the models under consideration. The
first model applied is the identity model, which provides
the framework for assessing linear hypotheses concerning
potential sources of variation. Once these preliminary
sources of variation are assessed, one can select an
appropriate model structure for further modeling efforts.
Once such an intermediate model is fitted and assessed as
appropriate, additional model reduction can continue through
the process of hypothesis testing and implied model fitting.
This thesis demonstrated a few directions that linear models
•
245
models can take. One of these is producing parameter
estimates for one model, and then sUbsequen~ly fitting
another model to those parameter estimates, which was done
in the section on supplemental margins. The section on
supplemental margins also demonstrated that estimates and
covariance matrices from analyses performed on separate
sUbpopulations can be put together so that another dimension
of modeling can be pursued. The usefulness of residual
analysis in categorical data analysis is also illustrated in
some of the analysis examples.
It is important to note that all of the analysis
performed in this work served to provide an appropriate
description of the variation in the CHESS dataset. The
linear models produced were intended as a descriptive tool
rather than an inferential device. Thus, the multiple
hypothesis tests performed at all the modeling stages also
need to be seen in the context of a descriptive analysis
intended to provide a good description of the data
themselves rather than as tests of statistical inference.
The significance levels adhered to throughout the analysis
chapters were used as guidelines for decisions regarding the
directions of modeling efforts. Resulting models should
thus be thought of as applying to this dataset only; there
is no basis for inferring results to any other population.
Thus, in terms of the discussion at the beginning of Chapter
II, this is strictly a local population analysis.
The analysis results appear to indicate a few
246
consistent patterns which hold across the individual
analyses of the varying data subsets examined. Males
reported more asthma than females, while females
consistently reported more colds than males. There were
always area differences, although the nature of those
differences varied. Time was a source of variation as well,
although it usually manifested itself as an interaction with
either sex or area. The pollution index was not included in
the linear models efforts, as it did not prove to be
associated with either response variables or explanatory
variables when randomization tests were performed in the
variable selection process. Thus, the possible effects due
to pollution may be part of the reason for area differences.
The missing data strategies applied to this dataset
were functional and useful, but mechanically difficult.
Supplemental margins has its advantages over multivariate
ratio estimation in that it results in weighted estimates
with reduced variances, and also allows one to get goodness
of-fit tests along the way, but it was very time-consuming
to implement and involved several different stages. Even
more stages are required if one feels the necessity of
element-wise deletion and case-wise deletion discussed in
the supplemental margins section. Multivariate ratio
estimation is easier to apply but its requirement of
indicator functions to denote whether a value is present or
not for a particular response measure reduces by one half
the maximum number of functions that one could include in
247
the function vector being modeled. Thus, the analysis of
mean colds in section 6.4 had to be split up into separate
analyses for the· sexes. As pointed out in Chapter VI,
multivariate ratio estimation was entirely adequate for use
as a missing data adjuster in this dataset, as the estimates
produced by the two procedures for section 6.3 were nearly
identical, the only real difference being that the variances
were somewhat lower for supplemental margins. However, that
might not be the case in a more moderately-sized dataset.
Additional work is needed to provide software to
automate these missing data adjustment procedures for
categorical data, and to increase their applicability in
terms of numbers of functions they are able to handle. It
would be interesting to research more fully the implications
of using supplemental margins versus multivariate ratio
estimation,i.e. is there a point in the sample sizes
involved where the reduced precision offered by multivariate
ratio estimation becomes a problem? It would have become
computationally impossible to duplicate this analysis if
there were four times involved instead of three. Easily
implemented software which would allow the presence of more
functions is also thus required from the point of view of
repeated measurements.
248
BIBLIOGRAPHY
Bates, P. V. (1967). Air pollution and chronic bronchitis.Archives of Environmental Health, 14, 220.
Bhapkar, V. P. (1966). A note on the equivalence of twotest criteria for hypotheses in categorical data.Journal of the American Statistical Association, 61,228-235.
Buechley, R. W., Riggan, W. B., Hasselblad, V. and VanBruggen, J. B. (1973). S02 levels and perturbationsin mortality. A Study in the New York-New JerseyMetropolis. Archives of Environmental Health, 27, 134.
Chapman, R. S., Shy, C. M., Finlea, J. F., House, D. E.,Goldberg, H. E. and Hayes, C. G. (1973). Chronicrespiratory disease in military inductees and parentsof schoolchildren. Archives of Environmental Health,27, 138.
Clarke, S. H. and Koch, G. G. (1976). The effect of incomeand other factors on whether criminal defendents go toprison. The Law and Society Review, 11, 57-92.
Cochran, W. G. (1954). Some methods of strengthening thecommon z test. Biometrics, 10, 417-451.
Cook, N.R. and Ware, J. H. (1980). Design and analysismethods for longitudinal research. Annual Reviewof Public Health, 4, 1-24.
Cornfield, J. (1944). On samples from finite populations.Journal of the American Statistical Association,39, 136-239.
Dohan, F. C. (1961). Air pollutants and incidence ofrespiratory disease. Archives of EnvironmentalHealth, 3, 387.
Dohan, F. C. and Taylor, E. W. (1960). Air pollution andrespiratory disease, a preliminary report. AmericanJournal of Medical Science, 240, 337.
Douglas, J. and Waller, R. (1966). Air pollution andrespiratory disease in children. British Journal ofPreventative Social Medicine, 20, 1-8.
249
Ferris, B. (1970). Effects of air pollution on schoolabsences and differences in lung function in first andsecond graders in Berlin, New Hampshire, January 1966to June 1967. American Review of Respiratory Disease,102, 591-607.
Forthofer, R. N. and Koch, G. G. (1973). An analysis forcompounded functions of categorical data. Biometrics,29, 143-157.
Gleason, T. C. and Staelin, R. (1975). A proposal forhandling missing data. Pyschometrika, 40, 229-252.
Glasser, M., Greenberg, L. and Field, F. (1976). Mortalityand morbidity during a period of high levels of airpollution. New York, November 23 to 25, 1965.Archives of Environmental Health, 15, 684.
Gould, A. L. (1980). A new approach to the analysis ofclinical drug trials with withdrawals. Biometrics 36,721-727.
Greenberg, L., Field, F., Reed, J. I. and Erhardt, C. L.(1964). Asthma and temperature change. Anepidemiological study in three large New Yorkhospitals. Archives of Environmental Health, 8, 642.
Greenhouse, S. W. and Geisser, S. (1959). On methods inthe analysis of profile data. Psychometrika 24,112.
Grizzle, J. E., Starmer, C. F. and Koch, G. G. (1969). Theanalysis of categorical data by linear models.Biometrics, 25, 489-504.
Hasselblad, V., Nelson, W. C. and Lowrimore, G.R. (1974).Analysis of Effects Data: Some Results and Problems inin Statistical and Mathematical Aspects of PollutionProblems. ed. John N. Pratt. Marcel Dekker, Inc.
Herman, Stewart W. (1977) The health costs of air pollution:A survey of studies published between 1967 and 1977.American Lung Association, 1740 Broadway, New York,N.Y. 1001~
Higgins, J. E. and Koch, G. G. (1977). Variable selectionand generalized chi-squared analysis of categoricaldata applied to a large cross-sectionaloccupational health survey.International Statistical Review 45, 51-62.
Hopkins, C. E. and Gross, A. J. (1971). A generalizationof Cochran's procedure for the combining of r x ccontingency tables. Statistica Neerlandica 25, 57-62.
250
Huynh, H. and Feldt, L. S. (1976). Estimation of the Boxcorrection for degrees of freedom from sample data inrandomized block and split-plot designs. Journal ofEducational Statistics 1, 69-82.
Kleinbaum, D. G. (1970). Estimation and hypothesis testingfor generalized multivariate linear models.University of North Carolina Institute of StatisticsMimeo Series, No. 609.
Koch, G. G. (1969). A useful lemma for proving theequality of two matrices with applications toleast squares type quadratic forms.Journal of the "American StatisticalAssociation, 64, 969-970.
Koch, G. G., Amara, I. A., Stokes, M. E. and Gillings, D. B.(1980). Some views on some parametric and nonparametric analysis for repeated measurements andselected bibliography. International StatisticalReview 48, 249-265.
Koch, G. G., Elashoff, J. and Amara, I. A. (1985).Repeated Measurement Studies, Design and Analysis.In Encyclopedia of Statistical Sciences,N. L. Johnson and S. Kotz, eds. 457-472,Wiley, New York.
Koch, G. G., Gillings, D. B. (1983). Inference, designbased vs: model based. In EncvcloDedia of StatisticalSciences 4, N. L. Johnson and S. Kotz (eds.) 84-88,Wiley, New York.
Koch, G. G., Gillings, D. B. and Stokes, M. E. (1980).Biostatistical implications of design, sampling andmeasurement to the analysis of health science data.Annual Review of Public Health 1, 163-225.
Koch, G. G., Imrey, P. B. and Reinfurt, D. W. (1972).Linear model analysis of categorical datawith incomplete response vectors. Biometrics 28, 663692.
Koch, G. G., Imrey, P. B., Singer, J. M., Atkinson, S. S.and Stokes, M. E. (1985). Analysisof Categorical Data. University of Montreal Press.
Koch, G. G., Johnson, W. and Tolley, D. (1972). A linearmodels approach to the analysis of survival and extentof disease in multidimensional contingency tables.Journal of the American Statistical Association 72,783-796.
•
251
Koch, G. G., Landis, J. R., Freeman, J. L., Freeman, D. H.,Jr. and Lehnen, R. G. (1977). A general methodologyfor the analysis of experiments with repeatedmeasurement of categorical data. Biometrics 33, 133158.
Koch, G. G. and Reinfurt, D. N. (1974). An analysis of therelationship between driver injury and vehicle age forautomobiles involved in North Carolina accidentsduring 1966-1970. Accident Analysis and Prevention.6, 1-18.
Laird, N. and Ware, J. (1982). Random effects forlongitudinal data. Biometrics 38, 963-974.
Lambert, P. M. and Reid, D. D. (1970). Smoking, airpollution, and bronchitis in Britain. Lancet, 1, 853.
Landis, J. R., Heyman, E. R. and Koch, G. G. (1978).Average partial association in three-way contingencytables: a review and discussion of alternate tests.International Statistics Review 46, 237-254.
Landis, J. R. and Koch, G. G. (1979). The analysis ofcategorical data in longitudinal studies of behavioraldevelopment. Chapter 9 in Longitudinal Research inthe Study of Behavior and Development, edited by J. R.Nesselroade and P. B. Baltes. Academic Press, NewYork, 233-261.
Landis, J. R., Stanish, W. M., Freeman, J. L. and Koch, G. G(1976). A computer program for the generalized chisquare analysis of categorical data using weightedleast squares (GENCAT). Computer Programs inBiomedicine 6, 196-231.
Lehnen, R. G, and Koch, G. G. (1974). The analysis ofcategorical data from repeated measurement researchdesigns. Political Methodology, 1, 103-123.
Levy, D., Gent, M. and Newhouse, M. T. (1977).Relationship between acute respiratory disease andair pollution levels in an industrial city .American Review of Respiratory Disease, 116, 167.
Lunn, J. E., Knowelden, J. and Handyside, A. Fatterns ofrespiratory illness in Sheffield schoolchildren.British Journal of Preventative Medicine, 21, 7-16.
MaCarroll, J. and Bradley, W. (1966). Excess mortality asan indicator of health effects of air pollution.American Journal of Public Health, 56, 1933.
252
Mantel, W., and Haenszel, N. (1959). Statistical aspects ofthe analysis of data from retrospective studies ofdisease. Journal of the National Cancer Institute.22, 719-748.
Mantel, N. (1963). Chi-square tests with one degree offreedom; extensions of the Mantel-Haenszel procedure.Journal of the American Statistical Association 58,690-700.
Martin, A. E. (1964). Mortality and morbidity statisticsand air pollution. Proceedings of the Royal Societyof Medicine, 57, 969.
Morrison, D. (1976). Multivariate Statistical Methods.MaGraw-Hill Book Company, New York, 2nd Edition.
Mosher, W. E. (1970). An effect of continued exposure toair pollution on the incidence of chronic childhoodallergic disease. American Journal of Public Health,60, 891.
Mostardi, R. A., Woebkenberg, N. R., Ely, D. L., Conlon, M.M. and Atwood, G. (1981). The University of Akronstudy on air pollution and human health effects II.effects on acute respiratory illness. Archives ofEnvironmental Health, 36, 5.
Muller, K. E., Smith, J. C. and Shy, C. M. (1981).Relationship between air pollution and children'spUlmonary function in six areas in the United States.Contract 68-02-2763 with EPA.
Neyman, J. (1949). Contributions to the theory of the ztest. In Proceedings of the Berkeley Symposium onMathematical Statistics and Probability. J. Neyman(ed.) University of California Press, Berkeley, 239273.
Puri, M. L. and Sen, P. K. (1971). Non-parametric Methodsin Multivariate Analysis. Wiley and Sons, New York.
Rao, M., Steiner, P., Qazi,Steiner, M. (1973).attack rate of asthmaResearch, 11. 73.
Q., Padre, R., Allen, J. E. andRelationship of air pollution toin children. Journal of Asthma
Schenk, H. H., Heimann, H., Clayton, G. D., Gafafer, W. andWexler, H. (1949). Air pollution in Donora,Pennsylvannia. Epidemiology of the unusual smogepisode of October 1948, Public Health Bulletin 306,U.S. Government Printing Office, Washington, D.C.
253
Shy, C., Goldsmith, 3., Hackney, 3., Lebowitz, M. andMenzel, D. (1978). Health Effects of AirPollution: Review for the American LungAssociation. American Lung Association.
C.· 3. and•
•
Shy, C. M., Hasselblad, V., Burton, R. M., Nelson,Cohen, A. (1973). Air'Pollution Effects onVentilatory Function of U.S. Schoolchildren.of Studies in Cincinnati, Chattanooga and NewArchives of Environmental Health, 27, 124.
ResultsYork,
Stanish, W. M. (1978). Adjustment for covariates incategorical variable selection and in multivariatepartial association tests. Unpublished dissertation,UNC /Chapel Hill.
Stanish, W. M., Gillings, D. B. and Koch, G. G.application of multivariate ratio methodsanalysis of a longitudinal clinical trialdata. Biometrics, 34, 305-317.
(1978). Anfor thewith missing
•
Status of the Community Health and EnvironmentalSurveillance System (CHESS). (November, 1980).Report to the U.S. House of RepresentativesCommittee on Science and Technology.EPA-600 \ 1-80-033. Office of Researchand Development U.S. Environmental Protection Agency,Washington, D.C. 20460
Sultz, H. A., Feldman, 3. G., Schlesinger, E. R. and Mosher,W. E. (1970). An effect of continued exposure to airpollution on the incidence of chronic childhoodallergic disease. American 30urnal of Public Health,60, 891.
Timm, N. (1975). MUltivariate Methods with Applications inEducation and Psychology. Brooks, Cole PublishingCompany, Monterey, California.
Timm, N. (1980). Multivariate Analysis of Variance ofRepeated Measurements, In P. R. Krishnaiah, ed.Handbook Of Statistics, Vol. I. North HollandPublishing Company, 41-87.
Toyama, T. (1964). Air pollution and its health effects in3apan. Archives of Environmental Health, 8, 153.
Verma, M. P., Schilling, F. 3. and Becker, N. H. (1969).Epidemiological study of illness absences in relationto air pollution. Archives of Environmental Health,18, 536.
254
Wald, A. (1943). Tests of statistical hypothesesconcerning several parameters when the number ofobservations is large. Transactions of theAmerican Mathematical Society. 54, 426-482.
Ware, J. H. (1985). Linear models for the analysis oflongitudinal studies. The American Statistician, 39,2 .
Whittemore, A. S. and Korn, E. L. (1980). Asthma and airpollution in the Los Angeles area. American Journalof Public Health, 70, 7.
Yoshida, R., Motomiya, K., Saito, H. and Funabashi, S.(1976). Clinical and epidemiological studies onchildhood asthma and air polluted areas. In ClinicalImplications of Air Pollution Research,Asher J. Finkel and Ward C. Duel., eds.Publishing Science Group, Inc., Acton Mass.,165-176.
•
.
•