€¦ · • ABSTRACT MAURA ELLEN STOKES. The Application of an Integrated Set of Categorical Analysis Methods to a Large Environmental Dataset with Repeated Measures and Partially

· ':1:')--_1 "'~.ili/I""".i•. -.I I

....-....~....... : ..,__'IIII:'..._ .1"ljW ,""'.....-

I i__ '_·I J.~.~ -+- ~~., ..

I I._. ',.,-. .__~"·"""l'~-r-"'-""-·.".

I ! ii·-i--....--r-~-*-T---~'I~~ l.~;'IIIAo~ ._h.',...,....--,-'_

, '

, __............. ......t.,_~-..\•.~~..._.~+-.-

'(_'_""~_"_W-+h"'_"~\'" .

THE APPLICATION DFAN IN±EG~EP $~ OE Ch~iGORICALANALYSIS METHO~s TO A LARGE ENVI~ONMENTAt 'DATASET

WITH REPEATED MEASURES AND PA~TI~YU'CbMPLETE DATA

by

Maura Ellen Stokes

Department of BiostatisticsUniversity of North Carolina at Chapel Hill

Institute of Statistics Mimeo Series No. 1807T

September 1986

•

..

..

,THE APPLICATION OF AN INTEGRATED SET OF CATEGORICAL

ANALYSIS METHODS TO A LARGE ENVIRONMENTAL DATASETWITH REPEATED MEASURES AND PARTIALLY COMPLETE DATA

by

Maura Ellen Stokes

A Dissertation submitted to the faculty ofThe University of North Carolina at ChapelHill in partial fulfillment of the requirements for the degree of Doctor of PublicHealth in the Department of Biostatistics

Chapel Hill

1986

Reader

•

ABSTRACT

MAURA ELLEN STOKES. The Application of an Integrated Set ofCategorical Analysis Methods to a Large EnvironmentalDataset with Repeated Measures and Partially Complete Data(Under the direction of Gary Koch).

The usefulness of some recently developed categorical

data methodology is evaluated through its application to a

large dataset which pertains to environmental health.

Multivariate randomization test statistics are employed in a

variable selection strategy to evaluate the association

between demographic variables in the dataset and the

response variables pertaining to the prevalences of colds

and asthma in children. Subpopulations are formed on the

basis of the results of the variable selection and weighted

least squares methods are utilized to describe the variation

among the response estimates of interest. Principal

attention is given to subpopulations based on area, race and

sex. Various types of modeling techiniques are illustrated,

including the use of residual analysis in assessing a given

model's appropriateness.

Analysis of the partially complete data is undertaken

with two different strategies. Multivariate ratio

estimation involves the calculation of multivariate ratio

e~timates of the means of interest and a corresponding

covariance matrix. Supplemental margins is a linear models

strategy in which the complete and incomplete data are

combined by treating the incomplete observations as members

of distinct subpopulations. These methods as well as some

variations are applied to this dataset and their advantages

and disadvantages evaluated.

•

ii

ACKNOWLEDGEMENTS

I gratefully acknowledge Gary Koch for his continuedqUidance, support and patience with me (although I stilldeny losing the auditron). I would also like to thank themembers of my committee, Craig Turnbull, Keith Muller, CarlShy and Kerry Lee for their efforts.

I would like to express my appreciation to my parentsfor making the idea of graduate school possible.

I would like to thank my brother Matt for his help inediting this dissertation.

I would like to express my appreciation to my friend,Lisa Lavange, for establishing that one can finish adissertation expediently, as well as get married, have achild, and establish a career so that the pressure was offme.

I am grateful for the support of friends, co-workers,and the often unfortunate office mates during the manycolorful phases of this entire process. I especiallyacknowledge the help it was to have company during the latenight activites at the trailer, RTI, and SAS.

And lastly, here's to Kate Lavange for not havingacquired the vocabulary yet to ask when it would befinished .

iii

TABLE OF CONTENTS

ACKNOWLEDGEMENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 i

I. REVIEW OF STUDIES CONCERNING AIR POLLUTION ANDHEALTH AND DISCUSSION OF CHESS PULMONARY DATA...... 1

1.1 Overview of Air Pollution Studies............. 11.2 CHESS Studies................................. 81.3 Children's Pulmonary Function Data............ 91.4 Pulmonary Function Data Analysis.............. 111.5 Data Structures for Categorical Data Analysis. 131.6 Data Description.............................. 161.7 Overview of Research.......................... 19

II. RANDOMIZATION TESTS METHODOLOGy.................... 38

2.1 Introduction............. . . . . . . . . . . . . . . . . . . . . . 382.2 Research Design Implications.................. 392.3 Randomization Test Methods.................... 41

2.3.1 First Order Association .2.3.2 Partial Association .2.3.3 Average Partial Association Methodology ..2.3.4 Mean Score Test .........•................2.3.5 Correlation Test .

4145485256

2.4 Multivariate Randomization Statistics......... 582 • 5 Summary....................... Ii • • • • • • • • • • • • • • • 61

III.WEIGHTED LEAST SQUARES METHODOLOGy................. 63

3.1 Introduction.... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2 Weighted Least Squares Methodology............ 65

3.2.1 Overview.................................. 653.2.2 Statistical Theory for Weighted Least

Squares. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.2.3 An Example of a Strictly Linear Model..... 743.2.4 Case Record Data.......................... 79

3.3 Overview of Repeated Measurement Analyses..... 833.4 Repeated Measurements Analysis for Categorical

Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893.5 Missing Data Strategies for Categorical Data

Analys is . 93

3.5.1 Ratio Estimation.......................... 943.5.2 Supplemental Margins...................... 97

3.6 Summary....................................... 104

IV. ANALYSIS OF COMPLETE DATA: VARIABLE SELECTION 106

4.1 Introduction .4.2 Variable Selection for Categorical Data .4.3 Variable Selection Extended to Multivariate

Response Prof i 1es .4.4 Application of Multivariate Variable Selection

to CHESS Data for 2+ Colds and 1+ Asthma in19 73, 1974, and 19 75. . . . . . . . . . . . . . • . . . . . . • . . .

V. LINEAR MODELS ANALYSIS OF COMPLETE DATA .

5.1 Linear Models Analysis of 1+ Asthma Data .5.2 Linear Models Analysis of 2+ Colds for 1973,

19~4, and 1975 for the Sex x Race x AreaCross-Classification .

5.3 Linear Models Analysis for 1+ Asthma for theArea x Sex Cross-Classification for 1973,1974, and 1975 Combined .

5.4 Linear Models Analysis of Mean Colds for 1973,1 974, and 19 7 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

VI. ANALYSIS OF INCOMPLETE DATA .

6.1 Introduction .6.2 Univar1ate7Afta~~sisof 1+ Asthma in 1973, 1974

and 1975 .6. 3 SupplemenJta1·~Marg1nsC in the Analysis of the

Proportions of Colds Reported in 1973, 1974,and 19,?i;~~.~it-\~t~~,~.~~\; .

6.3. 2 Supplimeft't~qqtsilrg!i1..was a Means ofExtending the Complete Sample Size .

6.4 Multivariate Ratio Estimation for the Analysisof Mean b~f1ai1'~ol\5the S'!x" Po1nt Data .

6.5 Analysis of the Proportion of Those ReportingAsthma in 1973, 1974 and 1975 withMultivariate Ratio Estimation .

VII.DISCUSSION .

BIBLIOGRAPHY .

iv

106107

110

112

122

122

135

155

161

174

174

175 e185

202

211

232

244

248

..

CHAPTER I

REVIEW OF STUDIES CONCERNING AIR POLLUTION AND HEALTH

AND DISCUSSION OF CHESS PULMONARY FUNCTION DATA

1.1 Overview of Air Pollution Studies

Air pollution has been a major concern of

industrialized countries for many decades. Besides being

credited with damaging the environment and the ecosystem,

air pollution has also been blamed for adversely affecting

human health. Such concern was part of the motivation for.Ci.e. .':.:~ ~=; J >.

the establishment of the U.S. En~ie~~~»t~}sp'~ptection

.. ~ . " . ~ _, c t,:' C ~L

Agency as well as the passag•• ~ oja~!Je.st<?If!.~J1r:!'!~~.' Act to set

permissable concentrations 9f .l!JP'9.Jf~~.. P9,U.\\t;ants. In the

past fifteen years, over2.l:W§Btf·:':.§t~UPJ.~~f!~yebeen conducted~:~"~jiorr:.~"J:i r:tr~.} r;.t1.Lht7.~3

to estimate health costs incurred due'to air pollution. The_c;ls~!1aJ Ot1S£ ~·~S~·ft

resulting figures range:;f:J;'PJIl ~"-::few::;~uf].p..r§4 million to ten

billion dollars per year (Herm~ri~1917i~~

However, scientific documentation of the harmful

consequences of air pollution has been difficult to obtain.

In addition, the exact mechanisms by which it might cause

damage are still being investigated. Historical events

which seemed to create a fairly strong case against air

pollution were acute episodes in which weather-induced air

stagnation served to increase markedly the concentrations of

pollutants. These were primarily sulfur oxide and

2

particulate complex resulting from the burning of coal.

Excessive mortality due to incidents of this nature occurred

in Meuse Valley, Belgium in 1930, Donora, Pennsylvania in

1948 and New York City in 1953 (Shy et. al. 1978). However,

the most severe example is the London fog of December, 1952,

in which up to four thousand deaths were attributed to the

unusual concentrations of air pollutants. These were mainly

the result of bronchitis, pneumonia, and other respiratory

and cardiac diseases. These episodes served to focus

attention on the potentially harmful consequences of air

pollution. Soon, some types of controls were established in

several countries. Also, systematic investigations into the

cause and effe91~~~~~~r~~~!~~tionwere initiated.

Besides th~e.,..,.~U,,~f:gt;. Q~id~,and particulate complex,\n,.. ':l .•_f1~ ...c . .1£1. ,

there are two oth~~ m~~~~m~¥¥~~ ~f air pollution which are

recognized: photoc;Q~Gat",o~q,'Jlts.,andmiscellaneous, .... ' .....~ .... , .~ -. 1...1.·.·~.~_JlJtl ,

which are producecl ,~t /.p~.tnt!tq~rces such as smelters, mines,......... J.•. ,. "....-;,.B ~._._.

and factories. The sulfur oxide/particulate complex is

believed to have the deleterious health effect of increasing

the risk of acute and chronic respiratory disease and

aggravating chronic lung disease. The photochemical

oxidants can cause eye irritation and respiratory symptoms,

including coughing and choking. Exposures to asbestos can

lead to malignancies of the lung, while high levels of

mercury can result in central nervous system trouble and

renal toxicity.

c

3

Various studies have been conducted to investigate the

connection between air pollution and daily mortality (Martin

1964: Buechley and Riggan 1913). One example. is a study led

by McCarroll and Bradley (1966) which examined daily deaths

in New York City from 1896-1965. They found episodes of

unusually high mortality which corresponded to days of high

pollution. Low wind speed and temperature inversion were

also recorded for those periods. They studied five of these

episodes closely and found that a rise in mortality occurred

on the peak days of the pollution and also that all ages

were affected.

Most of these types of studies found greater mortality

on those days with higher-than~u.~il~~Bli&tion. However,

confounding factors in this ·r~rett-toIf§lti~iBm~J:iricludeweather

condi tions, season, and flll e:-Bf-d3m?8f~~::' iW'his study in New

York City from 1962-1966;:":i«~diiIey-i:'-(i-g73if'::cittemptedto

control for such extenti~ti~Nqiacf8;:~d~ieciookedat daily

reported deaths from all '~c'ati'§~~, ~:-cfrfcfC'adfuJ-ted their counts!"

for seasonal cycles, temperature ex~remes, holidays,

weekdays, and influenza epidemics. SO was still found to

be correlated with daily mortality after these adjustments,

as were the particulate pollutants.

Researchers have also directed attention to the

relationship between air pollution and morbidity. Those

studies examining a potential association between chronic

respiratory disease and air pollution also noted problems

with potentially confounding factors such as smoking, nealth

4

conditions, and exposures at the workplace. Many studies

have been undertaken, including some in the United States,

Britain, and Canada, and most have indicated a positive

association between chronic respiratory symptoms and

pollution -- specifically sulfur oXide and particulates

(Lambert and Reid 1970; Bates 1967; Chapman et al. 1973).

Lambert and Reid did a mail survey of 9975 persons in Great

Britain and found that prevalence rates for symptoms

increased with increasing air pollution, and that cigarette

smokers had higher rates than non-smokers. However, there

are so many potential confounders that it's difficult to

assess the effect air pollution had by itself. Occupational

exposures, socioeconomic factors, selective migration, and", r;:; r..::':~~}2 '-::)1.'"

smoking behavior tend,to cloud the issue and the progressive". ~~~ ~~c~'q~=2!:~'

nature of chron~~ 5~~1Hr~~~rZ5~1:;~asemakes it difficult to

evaluate the effect of a~r pollution on the disease.'to: -:::.£i od€~.. ~. ?!1':':"'!.:'

The inciden5~ gt~a¥.~i£~f~s~2gatorydiseases has also

been observed to be higher in areas with high levels of:.":' :'~!.\J ::'it"'C': ~.

pollution. Dohan and Taylor (1960) studied female RCA

employees in several U.s. cities during 1957-1960. They

looked at respiratory illness lasting seven or more days,

and found it correlated with sulfation rates. This study

did not adjust for season, which can greatly affect the

occurrence of respiratory disease. Later studies did adjust

for season and temperature, as well as for social class.

White collar workers at a New York City insurance company in

1965-1967 were found to have higher daily respiratory

•

•

•

•

•

•

•

5

disease absences during periods of higher S02 and

particulate concentrations, even after adjusting for season

(Verma et al. 1969). Levy (1977) found that hospital

admissions for respiratory disease in Hamilton, Ontario were

increased on days of heavy pollution, also adjusting for

season.

Another respiratory disease that has been studied with

respect to air pollution is bronchial asthma. Schrenk et

al. (1949) found that eighty-eight percent of asthmatics

living in Donora, Pennsylvania during the 1948 episode

reported having symptoms during that period. Some studies

have focused on the incidence of asthmatic attacks during

periods of acute air pollution, ~h~n"'i~~~l~~'ly asthmatics

would become one of the more sli~ge~dbl~::~foii~s. Emergency

room visits for asthma incr~~~ed~lb~~~~?~~6f seven New York

City hospitals studied dJ:~1~g[k:·i9·~i.(';al~ ~ollution episode

(Glasser et al. 1976). S6~~?~fual~~Edla h~~ find an

association between emergency"rg~m~t1§1~s~~ndair pollution

levels (Rao et al. 1973; Sultz et al. 1970). Still other

studies have found a link between asthmatic attacks and

level of pollution in the community (Yoshida 1976).

The effect of air pollution on the occurrence of

respiratory illness in children has also been the subject of

investigation. Their high susceptability to respiratory

illness would appear to make them choice candidates for

early victims of environmentally-induced respiratory

problems. Also, with this study population, the role of

6

smoking behavior in the overall picture is no longer a

factor to be controlled, at least for very young children.

As the authors of "Health Effects of Air Pollution" state:

' ... an impressive number of studies has consistentlydemonstrated an association between acute respiratoryillness rates in children, particularly illnesses ofthe lower respiratory tract, with residence in morepolluted communities affected by the sulfur oxide/particulate category." (Shy et at. 1978)

One major study was conducted in Great Britain,

beginning in 1946. More than 3000 children born in Great

Britain during the first week of March of that year were

followed and evaluated periodically as part of a

longitudinal health survey effort. Subjects were grouped

into one of four poll~t~on categories, mostly depending on

coal consumption in ~~~~~egipn. A combination of mothers'

reports, doctors' F~~Q~~Q~n~~ealthexaminations were used

to gain information:; 5'1Jo !!~~ 9B-A.~d/ s respiratory history.

There was a defini~,~~se~~~~~en ~etween lower respiratory

symptoms and polluti.9nCf,,:1:~~pry; in fact, there was a

gradient in level of reported disease from the lowest

pollution area to the highest. There was not a similar

association for upper respiratory disease symptoms (Douglas

and Waller 1966).

Another study conducted in England in 1964 concerned a

group of 819 five year olds in one of four areas of

Sheffield selected for their varying degrees of air

·pollution. This was based on smoke/sulfur dioxide

gradients. Researchers found that there was an association

between air pollution level and incidence of both upper

7

respiratory symptoms and lower respiratory symptoms. Force

expiratory volume (FEV) and forced vital capacity (FVC) were

measured as well, but were not significantly different from

one area to another (Lunn et al. 1967).

Several other investigators have used schools as a

base for their study of children. Toyama examined peak flow

rates for schoolchildren living in Osaka and Kawasaki,

Japan, both heavily polluted from industrial sources. He

found that those children living in the more polluted areas

had lower peak flows than those living in the less polluted

areas. Ferris studied pulmonary function as well as school

absences in first and second graders in Berlin, N.H., from

January 1966 to June 1967. Pollutiott~~rom pap$r mills is a....., t' " '. ~ t. t· .- . '. _' ....., ':.

major environmental problem here~t:':'Whl1e school absences

were not significantly differen~"'to=r:1tno$'e in the more

polluted neighborhoods, the"peilC ~16w -rates were lowest for

those from the more heavily po·l1.~-te'd<:;~eas (Ferris 1970).

Shy, et al. (1973) found some dfr'fef.ences::'1n ventilatory

function for certain race/age groups of children living in

the more polluted areas of Cincinnati, Chattenooga, and New

York City. However, the differences were usually fairly

small. Researchers at Akron, Ohio monitored air pollution

levels at two elementary schools and conducted pulmonary

function tests once children were discovered to be

symptomatic of acute respiratory disease. They concluded

that those children at the school with higher S02 and N0 2

pollution had higher incidences of disease: also, their

a

pulmonary function was further decreased from that of the

other school.

1.2 CHESS Studies

In 1967 EPA organized what was to be a multi-million

dollar effort to assess air pollution's effect on health in

the United States called the Community Health and

Environmental Surveillance System (CHESS). The system

actually encompassed many different studies, some

retrospective and some prospective. The common thread was

to involve communities or areas with exposure gradients for

a particular pollutant pollution associated with sulfur

oxides, particulates and oxidants was under investigation.

In addition, factors such as age, race, sex, and socio-

economic status were to ,?e._controlled, either through the

use of homogeneous communi tie, or as part of a later

statistical analy,is. Such populations as the elderly,;:':-"f: ..

asthmatic, and children w~re to be investigated, and health~ 1~·lr.

indicators such as disea~~ occurrence and pulmonary function

used. In 1972, over 250,000 people were involved in CHESS

studies (Report to the U.S. House of Representatives

Committee on Science and Technology 1980).

However, when the first set of CHESS data was

collected and published in 1974, the program became very

controversial. There were problems with data quality,

particularly the air monitoring data, and also concern with

some of the health questionnaires used. A Congressional

investigation ensued and concluded that the data pUblished

•

9

could not be used to support its estimates for the specific

levels of pollution which were associated with serious

health effects. As a result, other CHESS datasets were

sUbjected to intense data validation efforts which were not

completed until 1978. In addition, the Congressional report

detailed the limitations of certain aspects of the CHESS

data.

1.3 Children's Pulmonary Function Data

One of the CHESS studies was directed at assessing the

relationship between air pollution and pulmonary function in

children. From 1972-1975, pulmonary function, as measured

by forced expiratory volume at .75 seconds (FEV), was

evaluated in the fall, winter, and spring for over 20,000>

elementary school children. Areas were selected according

to a criterion of expected pollution gradients, and included

Charlotte, N.C., Birmingham, Alabama, New' York, the Salt

Lake Basin in Utah, and two separate areas in the Los

Angeles Basin in California. Withiri'these areas,

communities, referred to as sectors, were chosen for study

for being similar to each other in terms of demographics but

varying with respect to degree of pollution exposure. Each

area included from two to six of these sectors and were

basically white and middle-class. The study included

children who were in the second, third, or fourth grade in

Fall, 1972. Measurements were taken nine times; these were

Fall 1972, Winter 1973, Spring 1973, Fall 1973, Winter 1974,

Spring 1974, Fall 1974, Winter 1975, and Spring 1975.

10

Besides FEV.15, other information gathered for each subject

included school, grade, birth date, race, sex, height, and ~

self-reported resp~ratory symptom which indicated whether or

not the subject had a cold and/or asthma at the time.

Pollutants that were monitored included total suspended

particulate matter, suspended sulfates, and sulfur dioxide.

In California, ozone levels were also monitored. Quarterly

geometric means were calculated from daily measurements of

the monitoring stations. On the whole, these were situated

such that most of the subjects' residences were within two

miles of the stations, which were sometimes located at the

schools themselves. There was a great deal of trouble with

the aerometric data. Procedural errors resulted in a

negative bias in the total";sulfate particulate levels of

from ten to thirt~'percent: Similar problems with suspended

sulfate measurements left the first two years of these data

with as much as a 50~ negat1vebias. Other methodological

and shipping errors led to data quality problems with sulfur

dioxide as well.

Researchers at the University of North Carolina/Chapel

Hill, under the direction of Carl Shy, entered into a

contract with EPA to clean and edit this particular dataset

and then perform a statistical analysis. Various stages of

data processing were required to produce a database of

sufficient quality to analyze. The EPA raw data consisted

of one record per subject for a year's data - i.e. fall,

winter, and spring measurements. A total of 60,836 such

11

records were available. There were many problems with these

data, including missing records due to absences or

migration, invalid date values, and difficulties in matching

records from one of the records to another. Data editing

was performed by Keith Muller and 30anna Smith, the end

result of which was an analysis file which contained

complete records, i.e. nine time points, for 3,666 sUbjects.

This involved matching 18,714 valid first year records,

20,980 second year records, and 21,142 third year records by

area, sector, school, and name. If perfect name matches

were not made, names were transformed to a Soundex-like

representation and matched to the third year records (Muller

et a1. 1981). Sex, race, birth mQn~h and birth year were

required to match across all th~~e years, with the possible

exception of one mismatch out of the twelve possible. Those

records missing values for FEV were .deleted, as well as

those corresponding to sUbje~t., ~~io.rti9g asthma at any of

the nine time points.

1.4 Pulmonary Function Data Analysis

The focus of the original analysis of these data was

to assess whether there was a relationship between pulmonary

function and air pollution in children. A multivariate

analysis of variance was used in order to investigate this

relationship. The dependent variables modeled were the nine

FEV.75 measurements for fall, winter, and spring for all

three years. The variable used to account for air pollution

exposure was an indicator variable for sector of residence.

12

This was considered appropriate since the sectors were

chosen according to an expected pollution gradient implied

by historical data. The pollution data collected showed an

observed gradient that, on the whole, came close to the

expected gradient (Hasselblad et ale 1974). The data was

split into 10' and 90' samples via a stratified random

procedure. Regressions were performed on a set of

demographic variables in the 10' sample to determine

relevant covariates. Those chosen were race, sex, height,

height squared, age, and age squared (Muller et ale 1981).

Seven analyses were performed altogether. Six were

area specific, in which case the important predictor was

sector, and the other design factors were year and season.

The seventh analysis included all the areas, and area

replaced sector as a factor~n its design. The basic

findings of the all-areas:analtsi~involved interactions

with area. For the within Birmingham, within Utah, and

within New York comparisons, significant relationships were

found between FEV.75 and sector. In Charlotte and

California II, no significant relationship was found, but in

these areas the expected pollution gradient was not observed

either. For California I, one did find the expected

pollution gradient but no relationship between sector and

pulmonary function was eVident. The authors concluded that

these analyses, taken together, supported a relationship

between pollution and patterns of pulmonary function in

children (Muller et ale 1981).

13

1.5 Data Structures for Categorical Data Analysis

The motivation for this dissertation is the analysis

of the categorical response measures in the CHESS dataset.

As mentioned above, a self-reported measure is included

which indicates whether or not a subject had a cold and/or

asthma at the time of the pulmonary function test. In

addition, this dissertation will deal with all the data

vectors collected, both complete and incomplete. Thus,

additional data management and the construction of new data

structures were required. The data profiles fall into one

of seven groups -- those corresponding to sUbjects with data

for each of the years 1913, 1914, and 1915, and six other

groups with data for various combinations of those years as

illustrated below:

",:.(

1. 1913, 1914, anet-19152. 1913_ a~d 1914 "'(.'3. 1913 and 19154. 1914 and 19155. 1913 only6. 1914 only1. 1915 only

For each year represented in one of these groups, there is

data for three time points corresponding to a fall, winter,

and spring measurement. Thus, there are three basic types

of profiles represented in this dataset. The first is the

complete data profile, consisting of nine time points per

observation. The second consists of six data points and

three missing, of which there are three kinds corresponding

to either 1913, 1914, or 1915 being missing. The third

profile is that which consists of only three data points,

14

with six missing. These arise when there is data for one

year only, so there are also three ways of producing this

particular profile.

The data management required to produce a dataset

containing these profiles consisted of going back to an

intermediate stage in Keith Muller's data management in

which records without complete data were left behind, and

merging them with a file which included the 'complete data'.

Edits done included deleting those records which were

missing key demographic information. For both the doubles

(those having data from two years) and singles (data from

only one year), observations were deleted if sex, race,

birthmonth or birthyear were missing at any of the six or

three time points. If there was a disagreement on a

demographic variable value for a doubles observation, it was~ s.

deleted. In order to keep the data consistent with that in

the original analysis file, records missing FEV.75 were also

eliminated. It was not felt that any data of consequence

would be lost due to this last action, since FEV.75 was the

focal point of a measurement period, and, if missing, tends

to throw into doubt the validity of other data registered

for that period.

The above editing process led to a dataset consisting

of 20,392 records, distributed as follows:

Data Profile1973 1974 1975 Number of Observations

15

yesnonoyesyesnoyes

noyesnoyesnoyesyes

nonoyesnoyesyesyes

4165342649791208

43421314049

The most striking impression one gets from this table is the

five-fold increase in the number of observations from

approximately four thousand to twenty thousand when the

incomplete data is included. The relatively low number of

observations for the 1973 and 1975 only profile is

understandable. One potential explanation is migration, with

the odds of a family returning a year later seemingly

limited.

Two main types of data structures were created for

initiating statistical analysis. The first consisted of

data for individual years, i.e. all those observations which

had data for 1973 went into the 1973 dataset, regardless of

which profile they represented. Similarly, datasets were

constructed for 1974 only and 1975 only. 9806 records were

included in the 1973 dataset, 10762 records in the 1974

dataset, and 11560 records in the 1975 dataset. The

categorical response variables had to be available for all

three time points in a year in order that an observation

qualify for a particular 'year' dataset (136 observations

did not meet this criterion and were excluded). It should

be noted that an individual could be represented in one,

two, or three of the 'year' datasets. The other data

16

structure of primary interest is that of the complete data,

those which had categorical data for all nine time points.

There were 4002 of these observations. The reason that this

complete dataset contains more data than the analysis

dataset discussed in section 1.4 is that the latter excluded

asthmatics. Also, it should be noted that the 4049 records

listed in the profile figures on the preceeding page had

data for each year, but not necessarily for each of the

three time points within each year. So, 47 records were

classified as having a three-year profile from the point of

view of the individual year datasets, but were not

acceptable into the 'complete' dataset.

1.6 Data Description

Tables 1.1-1.4 contain the demographic

crosstabulations for the 1973, 1974, 1975, and Complete

datasets respectively.- -.;Total numbers of White, Black,

Hispanic, and Other races are displayed within sex and

geographic area. Females have a slight advantage over males

in the Complete data at 51%, while this is reversed for the

individual datasets, as the percent of males is

approximately 51% for them. Birmingham is the area with the

most subjects and New York City has the fewest. The

Complete data is 79% White, 15% Black and 3% Hispanic with

the remaining 1% classified as Other. Race percentages for

the individual year datasets follow the same pattern

closely, with White ranging from 75%-79%, Black ranging from

15%-19%, Hispanic ranging from 3%-4% and Other at 1%-2%. It

..

17

should be noted that several cells of this area by sex by

race demographic crosstabulation were not represented in the

Complete dataset.

Tables 1.5-1.8 contain information related to the

classification of each area by a pollution index. The index

was created from the aerometric data collected as part of

the study and displayed in Muller et al. (1981). Average

sector ranks were calculated for Total Suspended

Particulates for 1972-1975, Total Suspended Sulfates for

1972-1975, and Sulfur Dioxide for 1972-1975. Scores of one,

two, and three, were assigned to the sectors for each of the

three types of measures, whe~e 111 was assigned to the

highest one-third rankings, 121 was assigned to the middle

third rankings, and 13 1 was assigned-to :the lowest third.

These scores were then added, resulting in total scores

ranging from 3 to 9. The following ,pollution index was then

created: 1--7,8,9, 2--5,6, and 3~-3,4•. Thus, '1 1

corresponds to lower pollution, 121 corresponds to medium

level pollution, and 13 1 corresponds to higher pollution

levels. Of course, these labels are relative to the CHESS

data, but they do provide a pollution gradient. There were

nine sectors classified as l~w pollution, mostly in

Birmingham. Six sectors were assigned to the medium

pollution group, and the remaining seven sectors were

classified as high pollution. Tables 1.5-1.8 display the

number of subjects which comprise the cells of an area by

sex by pollution index crossclassification for the 1973,

18

1974, and 1975 and Complete data. Note that there will be

cells with no elements since not every area will have

sectors representing each of the pollution levels.

The outcome variables of interest depend on the data

structures which are being analyzed. For the 1973, 1974,

and 1975 datasets, a variable was created which indicated

the number of times a subject reported having a cold that

year. Thus, the possible values are 0,1,2, and 3 since there

were three measurement periods. A similar variable was

created to indicate the number of times asthma was reported

during the course of the year. Tables 1.9-1.11 display the

mean number of colds reported by area, sex, and race for

1973, 1974, and 1975 .. Charlotte has the highest mean colds

in 1973 and 1974 with· mean colds of .94 and .93, while Utah

takes the lead in 1975c.with.a mean number of colds of .94.

NYC has the fewest cold~:per.year for all three years. Its

lowest value is .62~colds.per year for 1974. The orily sort

of trend one might discern by looking at these tables is

that females consistently report more colds than males.

Table 1.12 is concerned with the proportion of children

reporting colds during each of the measurement periods for

the Complete data. Proportions are listed for males and

females within each of the six areas. Charlotte has the

highest proportions of colds in general, and females have

more colds than males at each of the nine time points.

Tables 1.13-1.15 display the mean number of colds

reported by area, sex, and pollution index for 1973, 1974,

•

•

•

•

•

•

19

and 1975 data. The mean is lowest for the higher pollution

category for all three years, which is interesting. The

range of values is fairly close, as they have a low at .73

for higher pollution in 1973 and a high at .83 for lower

pollution in 1974. Again, females consistently have a

higher number of colds than males. Finally, Table 1.16

displays the proportion of colds at each measurement period

by sex and pollution index for the Complete data. Females

are still reporting more colds than males within each of the

pollution categories.

1.7 Overview of Research

The overall purpose of this dissertation will be to

evaluate the usefulness of some recently developed

categorical data methodology ..in the analysis of a very large

dataset pertaining to an environmental.health problem. One

direction of analysis is the ,invest·igatron of the

association between categorical .healthcstatus measures and

extent of air pollution, in order toccomplement the work

that has already been accomplished for the continuous

measures of the data. Health status is measured by

variables indicating whether or not there is a cold or

asthma recorded at one of nine time points in the study, and

pollution level is indicated by a three-point scale

calculated from the aerometric data collected as part of the

study. One type of the methodology under study will be

randomization techniques which under minimal assumptions

allow one to assess the strength of the relationship of a

20

response measure and evaluation measure while controlling

for the effects of confounding variables. Other methods to

be employed will be extensions of the weighted least squares

regression methodology outlined in Grizzle Starmer and Koch

(1969) which are appropriate for the repeated measures and

incomplete data aspects of the study. Some of these

strategies have only been applied to simplified illustrative

examples, so it is of interest to assess their

appropriateness when applied to a large health dataset.

The first stage of analysis will concentrate on

assessing the relationship between the response variable

(e.g. number of colds) and the evaluation variable (e.g.

pollution) through the use of first order association

statistics. Mantel-Haenszel strategies will be employed to

investigate whether these first order associations (if they

exist) are maintained after adjusting for potentially

confounding factors such as geographic area, sex, race, and

age. The second stage of this part of the analysis will be

to use weighted least squares techniques to model the

relationship of health status and pollution level across the

demographic configurations which display variation. This

relationship will also be modeled across time for those

subjects with complete data.

Another objective of this dissertation will be to

address the analysis of the partially complete data vectors

in this dataset. One strategy of interest is that of

multivariate ratio analysis (Stanish Koch and Landis 1978).

21

This involves the calculation of multivariate ratio

estimates of the means and a corresponding covariance matrix

estimate. For the data under study, this might consist of

mean colds per year. The variation among these means could

thus be analysed using asymptotic regression methodology.

Another method of interest is that of supplemental margins

(Koch Imrey and Reinfurt 1972). This is a linear models

technique in which the complete and incomplete data are

combined by considering the incomplete observations as

members of distinct subpopu1ations. These subpopulations

then contribute whatever information they contain to the

marginal proportions that are subsequently formed from the

data as a whole and analysed.

In conclusion, this dissertation will be concerned,.

with several aspects of the analysi~ of,a large longitudinal

dataset. One objective will be to determine if the

integrated set of relatively new catego~ica1 analysis

procedures are an appropriate resource with which to answer

substantive questions about the relationship between health

status and air pollution which the study sought to answer.

Another aspect will be the application of categorical

analysis strategies to partially complete data and the

evaluation of their usefulness, particularly in the context

of an especially large dataset. Finally, the extent to

which modifications to such procedures would make them more

helpful will also be evaluated.

NN

e e

TABLE 1.1AREA BY RACE BY SEX FOR 1973 DATA

I SEX I I1-----------------------------------------------------------------------1 11 MALE I FEMALE I SEX 11-----------------------------------+-----------------------------------+-----------------1I RACE I RACE 1 MALE 1 FEMALE I1-----------------------------------+-----------------------------------+---------+--------1I WHITE I BLACK IHISPANlcl OTHER I WHITE I BLACK IHISPANlcl OTHER I TOTAL I TOTAL I TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I N 1 N I N I N I N I N I N I N I N I N I N

----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA I I I I I I I I 1 1 I----------1 I I I I 1 I I I 1 ICHARLOTTE I 4421 2021 21 21 4621 2091 NONE I 1 I 64B I 6721 1320----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 10601 4881 21 31 10031 4451 11 41 15531 14531 30061----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1NVC I 2381 551 21 21 230 I 621 81 41 2971 3041 6011----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1UTAH I 7271 41 521 181 7111 21 791 191 8011 8111 16121----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI I I 8551 201 761 391 7871 201 771 331 9901 9171 19071----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI II I 6731 11 401 161 5651 31 501 121 7301 6301 13601----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1

1TOTAL I 39951 7701 1741 801 37581 7411 2151 731 50191 47871 98061

e

C""lN


I SEX I I I1-----------------------------------------------------------------------1 I II MALE I FEMALE I SEX I I1-----------------------------------+-----------------------------------+-----------------1 II RACE 1 RACE I MALE I FEMALE I I1-----------------------------------+-----------------------------------+--------+--------1 II WHITE 1 BLACK IHISPANICI OTHER I WHITE I BLACK IHISPANICI OTHER I TOTAL I TOTAL I TOTAL I1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1

liN I N I N I N I N I N IN 1 N I N I N IN 11----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IIAREA I I I I I I I I I I I 11----- -----I 1 I I I I I I I I I 1ICHARLOTTE 1 4201 2081 41 11 4231 2191 11 31 6331 6461 127911----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IIBIRMINGHAMI 13111 6431 NONE I 41 12381 6251 NONEI 61 19581 18691 382711----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IINYC I 3911 1291 181 91 3421 1381 141 51 5471 4991 10461l~---------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I

IUTAH I 7121 51 581 161 6791 31 711 161 7911 7691 156011----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IICALI I I 8111 181 741 371 7311 201 761 401 9401 8671 18071L----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IICALI II 1 6101 11 451 181 5091 31 421 151 6741 5691 124311:---------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------IITOTAL I 42551 10041 1991 851 39221 10081 2041 851 55431 52191 107621

e e e

-:tN

e e


I SEX I 11-----------------------------------------------------------------------1 1I MALE 1 FEMALE I SEX I1-----------------------------------+-----------------------------------+-----------------11 RACE I RACE 1 MALE 1 FEMALE I1-----------------------------------+-----------------------------------+--------+--------11 WHITE I BLACK IHISPANIC! OTHER I WHITE I BLACK IHISPANICI OTHER ! TOTAL I TOTAL 1 TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I N I N I N I N I N 1 N I N I N I N I N I N

----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA I I 1 1 1 1 I I I I I__________ 1 I I 1 1 I 1 1 1 I 1CHARLOTTE 1 5671 2861 41 11 5561 3071 11 6! 8581 8701 1728----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 13281 6741 11 11 12731 6671 11 91 20041 19501 3954----------+--------+--------+--------+--------+--------+--------+--------+--------~--------+--~-----+--------

NYC I 3731 1131 151 91 3211 1221 121 101 5101 4651 975----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 7171 11 591 191 6731 31 701 141 7961 7601 1556----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI I I 9021 261 811 521 8271 181 761 491 10611 9701 2031----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI II I 6471 21 461 231 5401 31 411 141 7181 5981 1316i----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1TOTAL I 45341 11021 2061 1051 41901 11201 2011 1021 59471 56131 115601

e

U"lN

e

TABLE 1.4AREA BY RACE BY SEX FOR COMPLETE DATA

I SEX I I1-----------------------------------------------------------------------1 II MALE I FEMALE I SEX I1-----------------------------------+-----------------------------------+-----------------\1 RACE I RACE \ MALE I FEMALE I1-----------------------------------+-----------------------------------+--------+------~-II WHITE I BLACK IHISPANICI OTHER I WHITE I BLACK IHISPANICI OTHER I TOTAL I TOTAL I TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I N I N I N I N I N I N I N I N I N I N I N

----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA I I I I I I I I I I I----------1 I I \ I I I I \ I ICHARLOTTE \ 1361 601 21 11 1811 671 NONEI NONEI 1991 2481 447----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 3031 2131 NONEI NONEI 3771 2121 NONE I NONEI 5161 5891 1105----------+--------+--------+--------+--~-----+--------+--------+--------+--------+--------+--------+--------1NYC I 631 14\ NONEI 11 621 121 21 21 781 781 1561----------+--------+--------+--------+--------+--------+----~-.~-+--------+--------+--------+--------+--------IUTAH I 320\ NONEI 261 31 3501 NONEI 331 31 3491 3861 7351----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI I I 4511 171 401 141 3931 81 381 181 5221 4571 9791----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI II I 2601 NONEI 191 91 2601 NONEI 211 111 2881 2921 5801----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1

ITOTAL I 15331 3041 871 281 16231 2991 941 341 19521 20501 40021

e e

\0N

e -TABLE 1.5

AREA BY POLLUTION INDEX BY SEX FOR 1973 DATA

I POLLUTION INDEX I 11-----------------------------------------------------I I1 LOWER I MIDDLE I HIGHER I POLLUTION INDEX 11-----------------+-----------------+-----------------+--------------------------1I SEX I SEX I SEX I LOWER I MIDDLE 1 HIGHER 11-----------------+-----------------+-----------------+--------+--------+--------1I MALE 1 FEMALE 1 MALE 1 FEMALE I MALE I FEMALE 1 TOTAL I TOTAL 1 TOTAL I TOTAL1------7-+--------+--------+--------+--------+--------+--------+--------+--------+--------

1 I N I N 1 N I N 1 N I N 1 N 1 N I N 1 N1----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA I I I I I 1 I 1 I 1----------1 I 1 I 1 I I I I 1CHARLOTTE I 3961 4051 2521 2671 NONEI NONEI 8011 5191 NONEI 1320--~-------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------

BIRMiNGHAMI 9761 9621 5771 4911 NONEI NONE I 19381 10681 NONEI 30061----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1NYC I 1081 1161 NONEI NONEI 1891 1881 2241 NONEI 3771 6011----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1UTAH I 3901 3891 4111 4221 NONEI NONEI 7791 8331 NONEI 16121----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI I 1 3071 2861 2181 1971 4651 4341 5931 4151 8991 19071----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI I I I 2301 2031 NONE 1 NONE 1 5001 4271 4331 NONE I 9271 1360 I----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1TOTAL 1 24071 23611 14581 13771 11541 10491 47681 28351 22031 98061

e

,...N

TABLE 1.6AREA BY POLLUTION INDEX BY SEX FOR 1974 DATA

1 POLLUTION INDEX 11-----------------------------------------------------1

I~ 1

1 LOWER 1 MIDDLE 1 HIGHER 1 POLLUTION INDEX 11-----------------+-----------------+-----------------+--------------------------11 SEX I SEX I SEX I LOWER I MIDDLE I HIGHER I1-----------------+-----------------+-----------------+--------+--------+--------1I MALE 1 FEMALE I MALE I FEMALE 1 MALE I FEMALE I TOTAL 1 TOTAL 1 TOTAL 1 TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I N I N I N 1 N 1 N I N I N I N 1 N 1 N

----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA 1 I 1 I I I I I 1 1----------1 1 I I I I I I 1 1CHARLOTTE 1 3161 3261 3171 3201 NONE 1 NONE I 6421 6371 NONE I 1279----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 12591 11991 6991 6701 NONEI NONEI 245BI 13691 NONEI 3827----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------NYC I 2121 1641 NONEI NONEI 3351 3351 3761 NONEI 6701 1046----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 3811 3681 4101 4011 NONEI NONEI 7491 8111 NONEI 1560----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------

iCALI I I 2801 2651 2091 1851 4511 4171 5451 3941 '8681 18071----------+--------+--------+--------+--------+--------+--------+--------~--------+--------+--------ICALI II 1 2281 1811 NONEI NONEI 4461 3881 4091 NONEI 8341 12431----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+~-------I TOTAL 1 26761 25031 16351 15761 12321 11401 51791 32111 23721 10762

e e e

IX)N

e e

TABLE 1.7AREA BY POLLUTION INDEX BY SEX FOR 1975 DATA

1 POLLUTION INDEX I I1-----------------------------------------------------I II LOWER I MIDDLE I HIGHER I POLLUTION INDEX I1-----------------+-----------------+-----------------+--------------------------1I SEX I SEX I SEX I LOWER I MIDDLE I HIGHER I1-----------------+-----------------+-----------------+--------+--------+--------1I MALE I FEMALE I MALE I FEMALE I MALE I FEMALE I TOTAL I TOTAL I TOTAL I TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I N I N I N I N I N I N IN 1 N I N IN

----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA I I I I I I I I I I----------1 I I I I I I I I ICHARLOTTE I 3031 3281 5551 5421 NONE I NONE I 631 I 10971 NONE I 1728----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 17681 12241 7361 7261 NONEI NONEI 24921 14621 NONEI 3954----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------NYC I 2101 1691 NONE I NONEI 3001 2961 3791 NONEI 5961 975----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 3851 3631 4111 3971 NONEI NONEI 74BI 8081 NONEI 1556----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI I 1 3241 3011 2251 1941 5121 4751 6251 4191 9871 2031----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI II 1 2481 1981 NONEI NONEI 4701 4001 4461 NONEI 8701 1316----------+--------+--------+--------+--------+--------+--------+--------~--------+--------+--------

e

TOTAL 27381 25831 19271 18591 12821 11711 53211 37861 24531 11560

0'\N

e

TABLE 1.8AREA BV POLLUTION INDEX BV SEX FOR COMPLETE DATA

I POLLUTION INDEX 1 I I1-----------------------------------------------------I I I\ LOWER 1 MIDDLE I HIGHER I POLLUTION INDEX I I1-----------------· ----------------+-----------------+--------------------------1 II SEX 1 SEX I SEX \ LOWER I MIDDLE I HIGH~R I II------------~----+-----------------+-----------------+--------+--------+--------1 II MALE I FEMALE I MALE I FEMALE I MALE I FEMALE I TOTAL 1 TOTAL I TOTAL I TOTAL I1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1I N I N I N 1 N I N I N I N 1 N I N I N

----------+--------+--------+--------+--------+---~ --+--------+--------+--------+--------+--------AREA I I I 1 I 1 I I I I----------1 I I I I I 1 I I ICHARLOTTE I 1141 1261 851 1221 NONEI NONEI 2401 2071 NONEI 447----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 3361 4321 1801 1571 NONEI NONEI 7681 3371 NONEI 1105----------+--------+--------+--------+--------~--------+--------+--------+--------+--------+--------

NVC I 211 211 NONEI NONE I 571 571 421 NONEI 1141 156----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH. 1 1881 2081 1611 1781 NONEI NONEI 3961 3391 NONEI 735

1----------+--------+--------+--------+--------+---------+--------+--------+--------+--------+--------ICALI I 1 1521 1321 931 851 2771 2401 2841 1781 5171 97911----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1ICALI II I 1071 1001 NONEI NONEI 181\ 1921 2071 NONEI 3731 58011----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1ITOTAL I 9181 10191 5191 5421 5151 4891 19371 10611 10041 40021

e e

oM

e -TABLE 1.9

MEAN COLOS IN 1973 BY AREA BY SEX BY RACE

e

SEX I I-----------------------------------------------------------------------1 I

MALE I FEMALE I SEX I-----------------------------------+-----------------------------------+-----------------1

RACE I RACE I MALE I FEMALE I-----------------------------------+-----------------------------------+--------+--------1

WHITE I BLACK IHISPANICI OTHER I WHITE I BLACK IHISPANICI OTHER 1 TOTAL I TOTAL I TOTAL I--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1

COLDS I COLDS 1 COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1

MEAN 1 MEAN I MEAN I MEAN 1 MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1AREA I I I I I 1 I I I I I I

1----------1 I I I I I I I I I 1 ICHARLOTTE I 0.861 0.671 0.501 0.501 1.021 1.161 ----I 1.001 0.801 1.061 0.94----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 0.671 0.511 0.501 0.331 0.951 0.891 1.001 1.751 0.621 0.931 0.77----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------NYC I 0 . 631 0 . 471 0 . 00 I 1 .00 I O. 751 0 . 581 0.50 I 0 . 50 I 0.60 I O. 70 I 0 . 65----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 0.691 0.751 0.421 0.441 1.001 0.501 0.711 1.111 0.661 0.981 0.82--------~-+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------

CALI I I 0.581 0.401 0.431 1.001 0.791 0,401 0.681 0.64\ 0.581 0.771 0.67----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI II I 0.721 0.001 0.551 0.311 0.931 0.331 0.721 0.17\ 0.701 0.901 0.79----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------TOTAL I 0.681 0.551 0.451 0.701 0.921 0.931 0.691 0.741 0.651 0.911 0.78

.....C"'l

TABLE 1.10MEAN COLDS IN 1974 BY AREA BY SEX BY RACE

SEX 1 1-----------------------------------------------------------------------1 1

MALE I FEMALE 1 SEX 1-----------------------------------+-----------------------------------+-----------------1

RACE I RACE I MALE I FEMALE 1-----------------------------------+-----------------------------------+--------+--------1

WHITE 1 BLACK IHISPANlcl OTHER I WHITE 1 BLACK IHISPANlcl OTHER 1 TOTAL I TOTAL 1 TOTAL--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------

COLDS I COLDS 1 COLDS 1 COLDS 1 COLDS I COLDS I COLDS I COLDS 1 COLDS 1 COLDS I COLDS--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------

MEAN I MEAN I MEAN I MEAN 1 MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------AREA 1 1 I I 1 1 I 1 I 1 I----------1 I 1 1 I 1 1 1 I I ICHARLOTTE I 0.781 0.791 0.251 1.001 1.101 1.051 1.001 1.001 0.781 1.081 0.93----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 0.771 0.611 ----I 0.751 0.971 1.031 ----I 0.671 0.721 0.991 0.85----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------NYC I 0.571 0.521 0.501 0.221 0.741 0.591 0.931 0.201 0.551 0.701 0.62----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 0.801 0.401 0.481 0.501 0.991 1.331 0.621 0.631 0.771 0.951 0.861----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI I I 0.681 0.671 ·0.431 0.891 0.831 0.401 0.711 0.451 0.671 0.791 0.731----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1CALI II 1 0.721 1. 00 1 0.331 0.50 I 0.991 0.671 0.741 0.40 I 0.691 0.951 0.811----------T--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1TOTAL I 0.741 0.631 0.431 0.661 0.941 0.961 0.701 0.491 0.701 0.931 0.811

e

---------------------------

,

-------------------------------------------

e e

NM

e e,

TABLE 1.11MEAN COLDS IN 1975 BY AREA BY SEX BY RACE

SEX I-----------------------------------------------------------------------1

e

C"')C"')

TABLE 1.12MEAN COLDS BY AREA BY SEX BY RACE FOR COMPLETE DATA

SEX I-----------------------------------------------------------------------1

I1

MALE I FEMALE I SEX I-----------------------------------+-----------------------------------+-----------------1

RACE I RACE I MALE I FEMALE 1-----------------------------------+-----------------------------------+--------+--------1

IIIIII

WHITE 1 BLACK IHISPANICI OTHER I WHITE I BLACK IHISPANICI OTHER I TOTAL 1 TOTAL I TOTAL 1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1

COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1

MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I MEAN I---------~+-----~--T--------+--------+--------+--------+--------+--------+--------+--------+--------+--------

AREA I I I I I I 1 I I I 1----------1 I I I I I I I I I 1CHARLOTTE I 0.891 0.801 0.501 0.001 1.101 1.151 ----I ----I 0.851 1.111 1.00----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------BIRMINGHAMI 0.631 0.371 ----I ----I 0.921 0.861 ----I ----I 0.521 0.901 0.72----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------NYC I 0.521 0.361 ----I 1.001 0.711 0.581 1.501 0.501 0.501 0.711 0.60----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------UTAH I 0.671 ----I 0.381 0.331 0.951 ----I 0.731 0.671 0.641 0.931 0.79----------+--------+--------+--------+--------+--------+--------+--------+---_._---+--------+--------+--------CALI I I 0.591 0.411 0.571 1.141 0.821 0.251 0.681 0.561 0.601 0.791 0.69----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------CALI II 1 0.741 ----I 0.631 0.441 0.971 ----I 0.861 0.181 0.721 0.931 0.83----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------

jTOTAL I 0.661 0.451 0.531 0.791 0.921 0.901 0.761 0.441 0.631 0.901 0.77i

e e e

"""C"'l

e e

TABLE 1.13MEAN COLDS BY POLLUTION INDEX BY SEX BY RACE FOR 1973 DATA

SEX I-----------------------------------------------------------------------1

MALE I FEMALE I SEX 1-----------------------------------+-----------------------------------+-----------------1


WHITE 1 BLACK IHISPANICI OTHER I WHITE 1 BLACK IHISPANICI OTHER 1 TOTAL I TOTAL 1 TOTAL--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------

COLDS 1 COLDS 1 COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS 1 COLDS 1 COLDS--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------

MEAN 1 MEAN 1 MEAN I MEAN I MEAN 1 MEAN 1 MEAN I MEAN I MEAN I MEAN I MEAN----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------POLLUTION 1 I I I I I I 1 I 1 IINDEX 1 1 I I I I 1 I I 1 I----------1 1 I I I 1 1 1 I I 11 1 0.671 0.551 0.481 0.731 0.911 0.861 0.751 0.851 0.651 0.891 0.77----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------2 1 0.731 0.551 0.261 0.651 1.041 0.971 0.631 1.001 0.661 1.001 0.82----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------13 I 0.661 0.451 0.571 0.711 0.841 0.961 0.671 0.32\ 0.651 0.821 0.731----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1

ITOTAL I 0.681 0.55( 0.451 0.701 0.921 0.931 0.691 0.741 0.651 0.911 0.781

e

LI)C"'l

e


SEX I. II------------------------------------~----------------------------------1 II MALE I FEMALE 1 SEX I1-----------------------------------+-----------------------------------+-··---------------1I RACE I RACE 1 MALE I FEMALE I1-----------------------------------+------------------------------~----+--------+--------II WHITE I BLACK IHISPANICI OTHER I WHITE I BLACK IHISPANICI OTHER I TOTAL I TOTAL I TOTAL1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1 COLDS 1 COLDS 1 COLDS I COLDS I COLDS I COLDS I COLDS I COLDS 1 COLDS I COLDS 1 COLDS1--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------I MEAN I MEAN 1 MEAN I MEAN 1 MEAN I MEAN I MEAN I MEAN 1 MEAN I MEAN I MEAN

----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------POLLUTION 1 I 1 I 1 I I I I I 1INDEX 1 I 1 I 1 I I I I 1 I----------1 1 I 1 I I I I 1 1 I1 I 0.771 0.611 0.461 0.381 0.961 0.821 0.631 0.611 0.731 0.931 0.83----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------2 1 0.711 0.641 0.441 0.701 0.941 1.061 0.661 0.621 0.671 0.971 0.82i----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------13 1 0.701 0.741 0.381 0.841 0.91 I 0.781 0.811 0.251 0.681 0.881 0.781----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1TOTAL I 0.741 0.631 0.431 0.661 0.941 0.96\ 0.701 0.491 0.701 0.931 0.811

e, e

\0C"'l

eII

e


SEX I I-----------------------------------------------------------------------1 I

MALE 1 FEMALE I SEX I-----------------------------------+----------------- ------------------+-----------------1


WHITE 1 BLACK IHISPANICI OTHER I WHITE 1 BLACK IHISPANICI OTHER I TOTAL I TOTAL I TOTAL--------+--------+--------+--------+--------+--------+ --------+--------+--------~--------+--------

COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS I COLDS 1 COLDS 1 COLDS 1 COLDS--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------

MEAN 1 MEAN I MEAN I MEAN I MEAN I MEAN I MEAN 1 MEAN I MEAN 1 MEAN 1 MEAN----------+--------+--------+--------+--------+--------+--------+--------+--------t--------+--------+--------POLLUTION I 1 I I I 1INDEX 1 1 I 1 I I----------1 I I I 1 I1 -I 0.751 0.581 0.451 0.931 0.961 0.781 0.631 0.691 0.721 0.921 0.82----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------2 I 0.711 0.481 0.521 0.711 1.001 0.831 0.781 0.741 0.621 0.921 0.77----------+--------+--------+--------+--------+--------+--------+--------~--------+--------+--------+--------I

3 I 0.671 0.411 0.521 0.751 0.901 0.881 0.641 0.831 0.661 0.881 0.761----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------1TOTAL I 0.721 0.511 0.491 0.791 0.951 0.821 0.671 0.751 0.671 0.911 0.791

e

t-("I")

e

TABLE 1.16PROPORTION OF COLDS BV POLLUTION BY SEX FOR COMPLETE DATA

I FALL 1 WINT I SPR I FALL I WINT I SPR I FALL I WINT I SPRI 72 I 73 I 73 1 73 1 74 I 74 I 74 75 1 751------+------+------+------+------+------+------+------+------1 PROP I PROP I PROP 1 PROP I PROP I PROP I PROP I PROP I PROP

---------------------+------+------+------+------+------+------+------+------+------POLLUTION ISEX 1 I I 1 I I I I II NDEX I I I I 1 I I I I 1----------+----------1 1 I I 1 I 1 I I1 IMALE I 0.201 0.231 0.191 0.231 0.311 0.231 0.251 0.301 0.20

1----------+------+------+------+------+------+------+------+------+------IFEMALE I 0.281 0.351 0.271 0.271 0.411 0.301 0.311 0.361 0.28

----------+----------+------+------+------+------+------+------+------+------+------2 IMALE 10.1910.2310.1610.2010.2610.211 0.201 0.221 0.19

1----------+------+------+------+------+------+------+------+------+------IFEMALE I 0.311 0.371 0.291 0.331 0.361 0.311 0.321 0.371 0.28

----------+----------+------+------+------+------+------+------+------+------+------3 IMALE I 0.251 0.221 0.191 0.181 0.311 0.211 0.221 0.251 0.22

I ,----------+------+------+------+------+------+------+------+------+------1 1FEMALE 1 0.281 0.311 0.251 0.251 0.391 0.291 0.291 0.341 0.281---------------------+------+------+------·------+------+------+------+------+----~-

I TOTAL 1 0.251 0.291 0.231 0.251 0.351 0.261 0.271 0.311 0.24

e e,

CHAPTER II

RANDOMIZATION TESTS METHODOLOGY

2.1 Introduction

The analysis of the CHESS dataset will illustrate a

two stage approach in categorical data methodology. The

first stage involves the application of randomization test

strategies to assess the extent of association in the data

between the response variables (e.g., number of colds) and

evaluation variables (e.g., pollution index,area),

controlling for the effect of potential confounding

variables (e.g. sex, age). Most of these analyses will

involve the use of Mantel-Haenszel type statistics as

reviewed in Landis et al. (1978). Chapter II will be

concerned with a discussion of such randomization

techniques. Sections 2.3.2-2.3.3 discuss the randomization

statistic and its use in evaluating average partial

association. The mean score test and correlation test

extensions which can be utilized if the response or both

response and evaluation variables are ordinally scaled are

outlined in sections 2.3.4-2.3.5. The final section in this

chapter involves a further extension of randomization

statistics when there are multiple response variables. The

use of randomization statistics in variable selection

processes is described and illustrated in a later chapter.

39

The other phase of analysis is that of modeling the

variation among a set of estimates produced from the data

through the use of weighted least squares techniques. The

methodology of Grizzle, Starmer, Koch (1969) for the linear

model analysis of categorical data is reviewed in Chapter

III. Examples of appropriate functions to model such as

mean scores are included, as well as a section on useful

analysis strategies for modeling the variation among

estimates derived from repeated measurements designs.

Finally, methods for the analysis of incomplete data are

given, including ratio estimation and supplemental margins.

First, the sections on data analysis methodology are

preceded with a discussion of the implications of the

research design under which data are obtained.

2.2 Research Design Implications

The nature of the statistical methodology applied to a

particular dataset is inherently linked with the research

design (or lack thereof) which gave rise to the data. The

analytical methods used as well as the interpretation of

their results and their generalizability to some extended

population are a function of the research sampling process

employed. Research data usually fall into one of the three

following design schemes:

1.0bservational (historical) data from subjects in astudy population having a natural (i.e.,geographical, temporal) definition.

2.Data from an experimental design situation in whichsubjects are randomly allocated to differenttreatments.

40

3.Sample survey data where subjects are randomlyselected from a larger study population.

Observational studies can include total population studies,

retrospective studies, and prospective in retrospect

studies. The CHESS dataset would be classified as a

prospective observational dataset.

The following discussion is in the spirit of Koch,

Gillings and Stokes (1980), and Koch and Gillings (1983).

The type of research design determines if and to what extent

two types of statistical strategies are applicable to the

research data. One is directed at the interpretation of

what is observed for the local population which the subjects

directly represent. The other is an extended population

analysis which generalizes to a larger target population

when certain assumptions about the sampling process can be

made. The local population coverage for a particular

research design depends on the degree to which randomization

was involved. A clinical trial experimental situation

involving fifty randomly allocated subjects may enable one

to make inferences to a local population which consists of

all the patients who had a known probability of receiving

either of two treatments. A sample survey data analysis

usually applies to a' large-scale local population due to the

nature of the sampling design employed. Observational data,

with its inherent lack of randomization, may only qualify

for a local population analysis with results relevant to the

observed population only; the subjects studied are only

representative of themselves.

•

41

In general, one would like to make inferences to a

more extensive population than the local population.

However, various sorts of assumptions may be required to

establish' the basis of generalization. One requirement is

that sample coverage is complete, i.e. all relevant

subpopulations are included in the sample. Another is that

any outcome differences within sUbpopulations be equivalent

to random variation. This has been termed the homogeneous

stratification assumption, and in stricter terms assumes

that the data are equivalent to a stratified random sample

for the target population in which the strata include

subjects partitioned according to an appropriate set of

explanatory variables. If the stratified population

structure is also assumable, then the framework is

equivalent to probability distributional sampling and the

appropriate statistical methods can be applied.

2.3 Randomization Test Methods

2.3.1 First Order Association

The first step in the analysis in this dissertation

involves an assessment of the association between the

outcome variables and the evaluation variable, and the

association between the outcome variables and the

demographic variables. The observational nature of the data

makes it reasonable to address this question in terms of the

following randomization hypothesis of no association:

Ho: There is no relationship between the subpopulationsand the response variable in the sense that theobserved partition of the response values into thesubpopulations can be regarded as equivalent to

42

a successive set of simple random samples.

The categorical data can be described in terms of the

general contingency table illustrated in Table 2.3.1.

Table 2.3.1Observed Contingency Table

Response Variable CategoriesSubpopulation 1 2 r

1 Y11 Y12 Y1r2 Y21 Y22 Y2r

s YS1 Ys2 Ysr

Total Y*l Y*2 Y*s

Total

Here, Yij denotes the number of subjects in the i-th

subpopulation who have the j-th response, with i=1,2, ... s

and j=1,2, •.. r; Yi * denotes the marginal total number of

subjects classified as being in the i-th subpopulation, Y*j

denotes the marginal total number of subjects classified as

belonging to the j-th response category, and Y.* denotes the

total sample size. Under the condition that both sets of

marginal totals are fixed, the randomization hypothesis can

be re-stated:

Ho : The response variable is distributed at random withrespect to the subpopulations,i.e., the data in therespective rows of the table can be regarded as asuccessive set of simple randomsamples of sizes {y *} from a fixed populationcorresponding to thf marginal total distributionof the response variable {Y*j}.

Stratified sampling arguments can be used to show that, on

follows the product hypergeometric distribution given by:

43

(2.3.1) Pr{y} =s rn Yi *! n Y*j!

i=l j=l

For observational data, these margins are fixed due to the

inherent nature of the hypothesis itself, and the

hypergeometric distribution assumption follows. With

experimental data, the row margins would be fixed by the

treatment allocation process, and the column margins fixed

by the hypothesis. Thus, the Y do not follow the

hypergeometric distribution when the null hypothesis is

rejected, and the implied hypergeometric model can not be

used as the basis for additional analysis. When the

hypergeometric model holds,

( 2 . 3 . 2 )

( 2 . 3 . 3 ) COV{Yij'Yi'j,}IHO } = V1j ,i'j'

= Yi*Y*j ( 6ii'Y**-Yi,*)(6jj'Y**-Y*j')2

Y** (y**-l)

where 6 kk ,= 1 if k=k'

= 0 if k~k'

When the sample size is large enough, mij ~ 5, the

44

vector y has an approximate multivariate normal distribution

by randomization central limit theory with covariance matrix

V. (See Puri and Sen (1971», where {vij } are the elements

of V. The quadratic form

Q=(y-m}IA I [AVA]-lA(y-m)

thus represents a randomization test statistic which is

approximately chi-square with degrees of freedom = Rank {A}

when A is specified so that AVA' is non-singular. The

particular matrix

generates (s-l) (r-1) linear functions from (y-m) which are

the differences between the observed and expected counts for

the first (r-1) cells of the first (s-l) sUbpopulations. In

this case, Q would have (s-1)(r-1) degrees of freedom. The

statistic Q can be shown to have the form

y ••-1Q =

(2.3.4)

= y••-l

where Qp is the familiar Pearson chi-square statistic.

Thus, for large samples, Q and Qp are asymptotically

equivalent. When the sample size is very small, the

45

significance of Q can be evaluated with exact methods which

yield Fisher's Exact Test for 2 x 2 tables.

2.3.2 Partial Association

The meaningfulness of a significant first order

association is limited, especially in an observational data

setting. Apparent associations may be due to the effects of

other variables which have not been taken into account in

the analysis. In addition, controlling for the effects of

such variables often leads to the uncovering of an effect

which was not evident in a first order association sense.

In epidemiological literature, this phenomena is known as

confounding, defined by Breslow and Day (1980) as

" ... the distortion of a disease/exposure associationbrought about by the association of other factorswith both disease and exposure, the latterassociation with the disease being causal."

Disease status can be considered the response variable,

while exposure variables can be considered evaluation

variables. Thus, confounding variables must be taken into

account when one is attempting to assess the relationship

between a response variable and evaluation variable. One of

the most frequently used methods to control for these

confounders is to partition the data into a set of strata

which are internally homogeneous with respect to the

confounding variables and then calculate a measure that

summarizes the associations within the strata as a whole.

Randomization model methods are an appropriate strategy for

accomplishing this within the context of hypotheses

corresponding to randomized allocations for the distribution

46

of the response variable across subpopulations for the

evaluation variable.

With categorical data, the confounder variable being

controlled will consist of a number of categories which

correspond to covariate levels such as hospital,

geographical area, or age group. These will be the basis of

stratification. Thus, the data can be expressed in terms of

q:s x r contingency tables where q is the number of

covariate levels. The hypothesis of interest can then be

stated in terms of no partial association between the

response variable and the evaluation variable after

adjusting for the possible effects due to the confounders.

Cochran proposed a test statistic for the hypothesis of no

partial association in q: 2 x 2 tables in 1954 (Cochran

1954). His statistic was based on a mean difference

weighted across the q tables. The evaluation of this chi

square statistic's significance required aSYmptotic

assumptions for a binomial model which requires moderately

large sample sizes for each of the strata (~20). In 1911,

Hopkins and Gross proposed a generalization of this

procedure to q: s x r tables (Hopkins and Gross 1911).

Mantel and Haenszel (1959) noted that the same problem

could be addressed within the context of a hypergeometric

framework which required only that the total sample size

(across tables) be large enough for asymptotic methods to

. apply. Their method utilized expected values and variances

for a pivot cell in each of the 2 x 2 tables, but is

47

invariant among which of the cells is selected. When sample

sizes are appropriately large, the Cochran and Mante1

Haensze1 methods lead to equivalent results, as the

statistics differ by the factor (Yh**-l)/Yh** where Yh ** is

the sample size of the h-th stratum. In the same paper,

Mantel and Haensze1 also outlined a procedure which would

examine the hypothesis of no partial association in a set of

q:2 x r tables. The method consists of calculating the

expected values and covariances of (r-1) pivot cells for

each of the tables using the hypergeometric model as before.

These -quantities are summmed and the hypothesis tested via a

quadratic form test statistic. However, the details were

only outlined in the (2 x 3) contingency table setting.

In 1963, Mantel discussed the situation involving

q: s x r tables where the response variable is ordinally

scaled with progressively larger intensities. He discussed

scoring systems and developed a test statistic which was a

function of the subpopulation mean scores. He suggested a

correlation type statistic if the evaluation variable was

also ordinally scaled. The test statistic based on (s-l)(r

1) pivotal cells has been outlined and used in data

analysis, as illustrated by Koch and Reinfurt (1974) with

highway safety data. Landis; HeYman, and Koch (1978) have

discussed the methods with which one can evaluate average

partial association in contingency tables and also tie

together the Cochran-Mantel-Haensze1 approach to the

analysis of q:s x r tables with a general notation and

48

matrix formulation. Koch, Imrey et al. (1985) also present

randomization statistics which apply in the analysis of

categorical data in a comprehensive manner. The next

sections detail the existing randomization methodology.

2.3.3 Average Partial Association Methodology

Let h=1,2, ... q index a set of q: s x r contingency

tables where h corresponds to a distinct level of a

stratification variable or combination of such variables.

The general table outlined in Table 2.3.1 can represent the

h-th such table, where s subpopulations are to be compared

with respect to a response variable which has r outcome

categories. All of the table entries should be considered

to have an 'hI appended to the beginning of the subscript.

Thus, let Yhij denote the number of subjects who are

classified as belonging to the h-th stratum,i-th

sUbpopulation and j-th response category. Let Yhi* denote

the marginal total number of subjects who are classified as

belonging to the i-th subpopulation for the h-th stratum and

Yh*j denote the marginal total number of sUbjects who are

classified into the j-th response level.

The hypothesis being investigated is the randomization

hypothesis of no association and is stated as follows:

Ho : For each stratum h=1,2, ... q, the responsevariable is distributed at random across thesUbpopulations; i.e. the data {y } for therespective sUbpopulations arose g,jsuccessivesets of stratified simple random samples of sizes{Yhi *} from the finite population distributions{Yh*j} for the q strata.

By the above hypothesis, the {Yhij} have the product

49

multiple hypergeometric distribution.

(2.3.5) Pr{vIHo} =qn

h=1

s rn yhi * ! n y h * j !

i=1 j=1

where Y=(Y1 ' ,Y2 ', ... Yq ')'. As reviewed in section 2.3.1,

the Yh have independent approximate multivariate normal

distributions. The mean vector and corresponding covariance

matrix are expressible as

2= Y**

( 2 . 3 . 6 )

rII

(yh **-1) IL

,I

D - Ph *Ph *'1 ePh. * . . I

oJ

r ,I IID -Ph* Ph* 'I =I Ph* . . IL· oJ

where Ph*.=(Yh1*'Yh2*' ... Yhr*)'/Yh** and

Ph.*=(Yh *1'Yh *2'···Yh *r)'/Yh **

DA represents a diagonal matrix with elements of the vector

A on the main diagonal, and e denotes Kronecker product

multiplication.

Consider the situation where both the response

variable and evaluation variable are measured on a nominal

scale. Ho can be tested in terms of the sum of the

individual approximate chi-square statistics Q for each of

the q strata, i.e.

(2.3.7)q

QTR == I:h==l

(y -m )' A' (AV A') -1A(y -m )h h h h h

50

where A==[I(r_1),O(r_1)] e [I(S-l),O(S-l)] as before. QTR is

approximately chi-square with d.f.==q(r-1)(s-1) and is termed

the total partial association statistic. There must be

sufficient individual stratum sample sizes in order that QTR

be appropriate.

Ho can also be investigated in terms of the sums of

corresponding differences between the observed and expected

frequencies across the q strata in an extension of the

Mantel-Haenszel methodology. Let

q qG == I:A(y -~); hence Var{GIHo }= I: AVhA'

h=l h h=l

Thus, an appropriate test statistic is written

(2.3.8) QAR ==

r , r , -1 r ,I q II q I I q II L (y -m )' A' I I L (AV A' I I L A(y -m )' IIh=l h h Ilh=l h I Ih==l h h IL .IL. .I L. .I

and is called the average partial association statistic. It

is approximately chi-square with d.f.=Rank{A} and is

effective in assessing the extent to which Ho is

contradicted by a consistent pattern of association of

response and evaluation variables across the strata.

An advantage of QAR over QTR is its less stringent

51

sample size requirements. Its asymptotic distribution

depends on the overall sample size y*** = LYh ** not each of

the individual Yh**' Also, since QAR's significance is

evaluated·with respect to fewer degrees of freedom, there is

a potential gain in power. It should be noted that QAR has a

narrow alternative to Ho in comparison with QTR' While the

null hypothesis of no difference in evaluation groups

applies to both statistics in this situation, the

alternative hypothesis for QAR can be stated as there is a

consistent difference. Thus, QAR may fail to detect a

difference when the differences are equally balanced. This

has been called the broad hypothesis, narrow alternative

property.

When s=r=2 in the case of a set of q: 2 x2 tables,

the average partial association (Mantel-Haenszel) statistic

becomes

QAR (Y*ll2= - m*ll)

q(2.3.9) ~ v h11h=l

q

Yh1*Yh2* 2~ (Ph11-Ph21)

h=lYh **

=q

~Yh1 *Yh2*.Yh*lYh*2

2

h=lYh ** (Yh**-l)

52

QAR is the same as the 1959 Mantel-Haenszel statistic, and

is approximately chi-square with 1 d.f.

2.3.4 Mean Score Test

When the response variable is ordinally scaled with

progressively larger intensities, Ho can be tested in terms

of the variation of stratum mean scores among the

subpopulations. Let ah'=(ah1,ah2, ... ahr) represent a set of

scores for the h-th table which characterize the relative

status of the response variable levels in some way. Let

(2.3.10)

be the mean score for the i-th subpopulation in the h-th

stratum. The vector of mean scores Yh**=(Yh1*'Yh2*'···Yhs*),

can be expressed as

(2.3.11) Vh *- =1 -1 r ,

D la' 0 lsi YhP L h .I

Yh ** h*.

va,h

By (2.3.6),

(2.3.12)

where

r -1 ,I D - 1 1 'I

)L s S .I

(yh **-l Ph *.

(2.3.14)r 2

va ,h=j:1 (ahj-~a,h) (Yh*j/Yh **)

53

which is the finite population variance for all subjects in

the h-th stratum.

Let C=[I(S_1),-1(s_1)] denote a matrix which will

compare the first (s-1) subpopulations with the s-th.

E{CYh**IHo } = 0(S-1)

v a,h r -1 ,I CD C'IL Ph * . .I

When sample sizes for the subpopulations within the strata

are large enough(e.g. Yhi* ~ 30), the Yh** will have

independent approximate multivariate normal distributions

via randomization central limit theory. Thus, the

statistics

(2.3.16) QRS,h, -1 ,

C'I CYh ** / v hiP ol a, .I

h* .

have approximate chi-square distributions with d.f.=(s-1),

and

(2.3.17)q

QTRS= h:1 QRS ,h

has an approximate chi-square distribution with d.f.=q(s-1).

It should also be noted that QRS,h can be verified to have

54

the analysis of variance form

(2.3.18) QRS,h

r sI

= (Yh**-l)1 LILi=l

The Mantel-Haenszel methodology can be extended to

encompass this mean score situation as well, by focusing on

the sum of the differences between the subpopulation mean

scores and their expected values under Ho • Let

be such a sum, where ~.C[ah' e Is] and C=[I(S_l),-l(s_l)]

as defined above.

E{GIHo} = 0

(2.3.19)

Thus, the randomization mean score statistic to detect

average partial association can be written

q q I -1(2.3.20) QARS ={ L(y -m )'Ah '}{ L AhVhAh )

h=l h h h=l

q{L Ah(Yh-mh)}h=l

where

Ah is defined as above. QARS is approximately chi-square

55

with d.f.=(s-l), and is directed at location shift

alternatives in the sense that it assesses the extent to

which mean scores in one subpopulation exceed (or are

exceeded by) the mean scores in the others. QARS is often

termed an analysis of variance test statistic, and as the

case of QAR' its asymptotic properties are linked to the

total combined strata sample sizes rather than to the

individual strata sizes.

Typically, applied scores include integer scores, rank

scores, ridit scores, binary scores and logrank scores.

Integer scores are written a hj = j=1,2, •.. r and are of

interest when the response levels are considered equally

spaced. Binary scores are those such as a hj = 1 for

j=1,2, ... k and a hj = 0 for j=(k+1)""r and are useful when

the focus is on comparing a certain combination of the

levels of the response variables with the others. They

collapse the response levels into a dichotomous variable.

Rank scores are defined as

(2.3.21)j-1

= { L Yh*k} + {(Yh*j + 1)/2}k=l

and are useful for ordinal data; midranks are used for

ties. Ridit scores are rank scores which are divided by the

strata sample sizes Yh**' When rank scores are applied,

QARS is equivalent to a partial Kruskal-Wallis statistic.

Logrank scores are written as

(2.3.22)

56

and are of interest for L-shaped distributions and other

distributions when differences in subpopulations are more

apparent in the upper response categories than the lower.

2.3.5 Correlation Test

An additional set of possible test statistics for Ho

occurs when both the response and subpopulations are

ordinally scored with progressively larger intensities.

Specifically, one can focus on the product score functions

(2.3.22)

where the {chi} characterize the relative status of the

subpopulation variable levels much as the {ahj } do for the

response variable for each of the q strata. (2.3.1) and

(2.3.2) imply

r , r ,I s II r I

(2.3.23) E{FhIHo} • I L chiYhi*/Yh**1 I L ahjYh*j/Yh**1li=l Ilj=l IL .I L .I

=- , JJc,h a,h

57

va,~

(Yh** - 1)

= vc,hva,h I (Yh ** - 1)

Suffic~ent stratum sample sizes {Yh **} (~ 30) provide for

randomization central limit theory to imply that the {Fh }

have independent normal distributions with (2.3.23) and

(2.2.24) as their expected values and variances,

respectively. A correlation test statistic can hence be

calculated as

r/ s

(yh **-l)1 ~li=lL

= (Yh**-l)(Fh - ( ~ )2 I v vc,h a,h . c,h, a,h(2.3.25) QRC,h2

1r /~ (chi-(c h)(ahi-~a h) (Yhij/Yh **)/

j=l ' , I~

=v vc,h a,h

= (Yh**-l) R2

hac,

where R h is the co~lation coefficient of response scoresac,

and subpopulation scores. The QRC,h have independent chi-

square distributions with d.f.=l. Thus, the total partial

association statistic

(2.3.26)q

QTRC = ~ Qh-' RC,h

58

is chi-square with d.f.=q.

The general form

is again appropriate to investigate Ho in terms of average

partial association, this time in the context of a

correlation - type statistic; QARC=Q, where

Ah

= [a 1 e c I]h h

for a h=(ah1 ' a h2 , ... a hs ) 1 and c h.. ( c h1 ' c h2 ' '.' . chs ) I, is

asymptotically chi-square with d.f.=l. It is directed at

the extent to which there is a consistent positive or

negative association between the subpopulation scores and

response scores across the strata, in an laverage sense. 1

If ridit or rank scores are applied, with midranks assigned

for ties, the QARC is equivalent to a partial Spearman Rank

correlation test.

2.4 Multivariate Randomization Statistics

There may be interest in assessing whether more than

one response variable is distributed at random across the q

strata. Suppose that there are d response variables, which

can be jointly cross-classified into

dr" n Lkk=l

59

response profiles, where the k-th response has Lk levels.

Thus, the data can still be considered as a q: S x r array,

but the frequencies now represent the joint distribution of

d response variables for s subpopulations. The

randomization hypothesis of no association can be expressed

as follows:

Ho:For each stratum h=1,2, ... q, the responsevariables are jointly distributed at r.andomacross the subpopulations

This hypothesis can be investigated with multivariate

extensions of the randomization statistics discussed in

section 2.3.

Let ~ = [a1 ,a2 , ... a d ]' denote a matrix of score

vectors for the h-th stratum, where a k (k=1,2, ... d)

represents a vector of scalings of the r response profiles

which construct summary measures for the d response

variables. For example, if two dichotomous response

variables were cross-classified to form four response

profiles, then appropriate binary scores might be

a 1 '=(0,0,1,1) and a 2 '=(0,1,0,1). A function vector which

represents the compounded mean scores is written

1(2.4.1) Fh =

and has expected value

1(2.4.3) Var{PhIHo} =

(Yh**-l)

r ,I IIAh(D -Ph *Ph *')Ah'l 8I P * .. IL h. .J

60

rII D-

l-1 1 '

Ips sL h*.

,II = VI Ph

.J

When the within stratum sample sizes are sufficiently large

(i.e. Yhi* ~ 30) the Ph have approximate mUltivariate normal

distributions via randomization central limit theory

arguments as previously discussed. One can then construct

quadratic forms in linear functions among these elements to

test Ho. Let C denote a (c x s) full rank matrix of among

subpopulation contrasts. Then a test statistic for the

comparison of the summary measures constructed from Ah in

the h-th stratum can be written

(2.4.4) QR,h

QR,h is approximately chi-square with dc d.f. and can be

thought of as a multivariate mean score statistic. For the

case where C=[I ,-1 ], then the test statistic QR h can bes s ,

shown to have a one-way analysis variance form. If rank

scores are used in the score matrix, given that the

dependent variables are ordinally scaled, then QR,h is the

multivariate Kruskal Wallis test (Puri and Sen 1971).

A total partial association statistic can be computed

61

when all the Yhi* are large.

is approximately ch1~square with d.f.=qdc when this is true.

However, it may be more appropriate to investigate Ho in

terms of an average partial association statistic. Such a

statistic would judge the extent to which consistent

patterns across the strata exist for the summary measures

constructed from the response variables. It has the fqrm:

( 2 . 4 . 5 )where

q

(2.4.6) F-E(F) = L (A 0 C)(Yh - mh )

h=land VF is the corresponding covariance matrix. QAR has an

approximate chi-square distribution with d.f.=dc.

2.5 Summary

There are a number of randomization test statistics

which can be applied to evaluate the hypothesis Ho as stated

in section 2.3.3. The Cochran-Mante1-Haensze1 approach to

the analysis of q:r x s contingency tables has been

presented in generality which includes matrix representation

and the analysis of multivariate response vectors. Average

partial association methods are emphasized since they

62

require less stringent sample sizes and do not require

homogeneity of subpopulation x response association, at

least not explicitly. It should again be noted, however,

that the alternative hypothesis of a consistent pattern of

variation implied by the use of average partial association

methodology is a narrow one. One needs to continually be

aware of this during its application. These methods have

another limitation in that the resulting inferences may only

apply to the actual subjects under study rather than to some

broader target population. For these reasons, such methods

may need to be supplemented by other statistical methoqs if

data description is an analysis objective.

CHAPTER III

WEIGHTED LEAST SQUARES METHODOLOGY

3.1 Introduction

"Randomization methods often need to be supplemented by

other procedures in an overall statistical analysis if there

is interest in describing the variation among a set of

estimates produced from the data in the context of a

statistical model. Hypotheses investigating the

significance of the various sources of variation can thus be

couched in terms of linear hypotheses concerning the model

parameters. With categorical data, weighted least squares

methods are often an appropriate means of fitting linear

models since the homogeneous variance assumptions of the

usual least squares strategies are not required. The

estimates being modeled might be those resulting from a

cross-classification of variables which were chosen by a

variable selection scheme (see Chapter IV). For example, in

the CHESS dataset, the estimates under study might be the

proportion of subjects who had asthma or the average number

of colds for a given year for an area x sex x pollution

index cross-classification. Section 3.2 reviews the

weighted least squares methodology which would be

appropriate for the analysis of such an example.

Since the dataset is longitudinal in nature,

64

consisting of a possible nine measurements over time for the

cold and asthma indication measures, repeated measurement

strategies are potentially advantageous in its statistical

analysis. The questions of whether time is an important

source of variation and whether it interacts with other

explanatory variables, as well as the nature of the pattern

of variation, can be addressed within a linear model

framework which takes into account the repeated measures

structure of the data. Section 3.3 gives an overview of the

application of linear models to repeated measurements data,

including some traditional approaches for continuous data.

In addition, those repeated measurement strategies which are

appropriate for categorical data are described.

A key feature of the CHESS dataset is the presence of

incomplete data. While there are 4002 observations which

are complete and can be analysed by themselves, there is a

lot of additional information in the remaining 16,343

observations which have three or six legitimate data points

but are also missing six or three measurements. This is

typical of most longitudinal studies, especially as the

number of subjects and/or measurement times grow large.

Section 3.4 discusses the strategies which can be employed

to deal with incomplete data. Specifically, it discusses

ratio estimation and supplemental margins, missing data

techniques which can be applied to categorical data.

65

3.2 Weighted Least Squares Methodology

3.2.1 Overview

Weighted least squares functional modeling is a

general strategy which allows one to describe the variation

among subpopulation summary measures which are functions of

within-subpopulation response distributions. The possible

choices for summary measures include a wide variety of

functions such as proportions, means, ratios, and kappa

statistics. The required elements of a weighted least

squares functional analysis (WLSA) are the specification of

the summary measures as a (u x 1) vector F, and a non-

singular, consistent estimator VF of the covariance matrix

of F. The function estimates are asymptotically normal, and

their asymptotic covariance matrix is model-free. Weighted

least squares can be applied to estimate the parameters of a

linear model

( 3 . 2 . 1 ) E {F} = P = XpA

where X is a (u x t) design matrix of full rank t ~ u and p

is a (t x 1) vector of unknown parameters. EA{ } represents

asymptotic expected value as the size of the sample on which

the summary statistic is based increases appropriately.

Another manner in which linear modeling can proceed is via

the construction and testing of hypotheses of the form

( 3 . 2 . 2 ) Ho : WP = 0

66

where ~ is a (u x 1) vector of expected values of the

functions under study, and W is a known «u-t) x u) matrix

of constraints. Both model specifications are linked, as

will be discussed. Wald statistics. are employed to assess

the appropriateness of both (3.2.1) and (3.2.2), and it can

be shown that the Wald goodness-of-fit for (3.2.1) is

identical to the minimized weighted sum of squares of the

residuals resulting from the observed values of the summary

measures and their model-predicted values. In addition,

when the data being analyzed are distributed as Poisson or

product multinomial, both the weighted least squares results

and the Wald statistic are equivalent to that which would be

obtained from minimum modified chi-square methods. The only

data situation discussed in this section will be product

multinomial.

The use of WLSA does require a probability model

linking the observed data to some sort of extended

population, in contrast to the randomization strategies of

Chapter II, the distribution assumptions for which were

conditional on aspects of the study design for the observed

data themselves. Use of the product multinomial

distribution is appropriate for data arising from stratified

simple random sampling, and potentially satisfactory for

sampling schemes which are arguably equivalent.

The discussion of the weighted least squares

methodology which follows is based on the presentation of

Grizzle, Starmer, and Koch (1969) as well as Koch et a1.

67

(1985). Computational details are also discussed in Landis,

Stanish, Freeman, and Koch (1976), which is the description

of a Fortran computer program named GENCAT which .implements

most of these strategies.

3.2.2 Statistical Theory for Weighted Least Squares

Assume that there exists s sUbpopulations indexed by

i=1,2, ••. s from which independent samples of size n i * have

been selected. Let j=1,2, •.. r index a set of response

profiles determined by one or more response variables which

uniquely classifies all of the subjects, and let Yij d~note

the frequency of the j-th response profile for the subjects

in the i-th subpopulation. Table 3.1 summarizes the data in

the form of a s x r contingency table.

Table 3.1

General Contingency Table for the Frequencies of ResponseProfiles for a Set of Subpopulations

SUbpopulation Response Vector Response Profile Total1 2 ..... r

1 y ,= Y11 Y12 ... y 1r n 1*2 y1,=

2 Y21 Y22 ... Y2r n 2*

s Ys I- Ys1 Ys2 Ysr ns*

Total Y*l Y*2 Y*r n**

Let YiI represent the vector of responses

Yi ' =(Yi1'Y 2' .•. Y r) for the i-th subpopulation. If

Y=(Y1"Y2', ...ys')' denotes the sr x 1 vector of all the

individual frequencies, then Y follows the product

multinomial distribution, given the assumed sampling

framework. The likelihood function is expressed as

(3.2.3)s

Pr{y} = n {i=l

68

where wij is ~he probability that a randomly selected

sUbject from the i-th subpopulation has the j-th response

profile. The {wij } satisfy the natural restrictions:

(3.2.4)r1: wij =1

j=lfor i=l, 2 , ... s.

Unbiased estimators for {wij } are the sample proportions

Pij-Yij/ni *. The covariance matrix of the {Pij} includes

the following components:

(3.2.5)

hef'

Vector notation allows one to express this more succinctly.

Let wi-(wi1,wi2"",wir)' and Pi=(Pi1,Pi2, ...Pir)'

(Yi/ni*) represent the parameter vector and its sample

estimate for the i-th population. Let w-(w 1 ',w2 '""ws ')'

denote the compound vector of paramters for all s

subpopulations and similarly, let P=(P1',P2', ...Ps')' denote

the sample proportion vector. It follows that

( 3 . 2 . 6) E ( p) = ~, Var ( p )

where

rII V 1 ("!l)I"IQrr= IIIL.

o'., rr

,I

. II

. . II

V (n ) I,s - s I

.I

69

(3.2.7)

is the covariance matrix for the i-th subpopulation. A

consistent estimator for V(n) is V(p), where p has been

substituted for n.

Functional linear models for the vector n are

expressed as

(3.2.8) F(n) = Xp

where F(n)=[F1 (n),F2 (n), .. Fu (n)]' is a set of u ~ s*(r-l)

functions, X is a known (u*t) design (specification) matrix

with rank t~u, and p is a t*l vector of unknown parameters.

Each of the functions is required to have continuous partial

derivatives through order two with respect to p in an open

region containing n=E(p). Another condition that F must

satisfy is that its covariance matrix be non-singular. This

covariance matrix can be written

(3.2.9) VF(n) = [H(n)][V(n)][H(n)]'

70

where H(n)=[aP/'zlz=n] is the first derivative matrix of

P(z). A sufficient condition for VF(n) to be non-singular is

Rank{H(n)]',[l r e Is]} = Rank{H(n)} + s.

This states that the functions F(n) and the natural

restrictions are linearly independent in an open region

containing n. Finally, there must be sufficiently large

sample sizes {ni *} for the estimators F(p) to have an

approximate multivariate normal distribution.

Given that the above conditions are satisfied, the

model (3.2.8) is asymptotically equivalent to the linear

Taylor series

(3.2.10) W[P(p)] + W[H(p)][n-p] = O(U-t)

for the constraints

(3.2.11) W(P) = O(u-t)

where W is a known [(u-t) x u] orthocomplement matrix to X

at the sample estimator p. By re-expressing (3.2.11) and

providing for the natural restrictions (3.2.4), a linear

structure for n is obtained which provides a basis for

71

finding an estimator n which minimizes the Neyman chi-square

criterion

(3.2.12 )

Bhapkar (1966) demonstrated that

(3.2.13)

where QW(F) is a Wald statistic (1943) for the model

corresponding to the linearized constraints (3.2.10). In

turn, a matrix identity lemma in Koch (1969) allows one to

show that QW is identically equal to

(3.2.14)

where

(3.2.15)

is the weighted least squares estimator for ~ in (3.2.1).

QN' QW(b), and Qw(F) all are approximately chi-square with

(u-t) degrees of freedom. Thus, b is the minimum modified

chi-square estimator, and as such is a member of the class

of best asymptotic normal(BA~) estimators as demonstrated by

Neyman (1949). These properties of b, and the ease with

which QN=QW can be calculated are reasons why the weighted

least squares approach to categorical data analysis was

promoted by Grizzle, Starmer, and Koch in their 1969 paper.

When the subpopulation sample sizes become

72

sufficiently large, b has a multivariate normal distribution

under the model (3.2.8):

(3.2.16)

If the model (3.2.1) does adequately characterize the data,

as indicated by the goodness-of-fit criterion Qw, then

linear hypotheses of the form C~=O, where C is a known c*t

matrix of constants of rank c, can be tested with the Wald

statistic

(3.2.17)

-1 -1where Vb={X'VF X} is a consistent estimator of Var{b}. QC

is approximately chi-square with d.f.=c under Ro· Qc is

identically equal to the difference between the Wald

statistic Qw for the model (3.2.8) and the goodness-of-fit

statistic Qw,c for the reduced model Xc=XZ where Z is a

known (t x (t-c)) orthocomplement to C. This means that Qc

is effectively a test statistic for the additional

constraints implied by XC. Predicted values F=Xb are useful

to calculate in order to facilitate model interpretation;

VF=XVbX' is a consistent estimator for their covariance

matrix.

In the original Grizzle, Starmer, and Koch paper

(1969), two types of functions were discussed. These

included those for the strictly linear model

(3.2.18)

and the log-linear model

F(n) = An

73

(3.2.19) F(n) =A [log n]

In the first case, the estimated function is F(p)=Ap and the

covariance structure is VF=AVpA I . For the log-linear model,

-1 -1F(p)=Alog(p) and VF=ADp VpDp AI. While many applications

of WLSA are covered with such functions, other applications

may require more complicated functions. These can usually

be generated as a sequence of linear, 10garithmic,and

exponential operations on the vector p. Consequently, VF(n)

is estimated by VF=[H(p)]Vp[H(P)]I (see 3.2.9) where H(p) is

a product of the first derivative matrices {Hk(p)} where k

indicates the k-th operation in accordance with the chain

rule. These matrices are relatively simple:

(i) for linear functions Az, H=A

(ii) for log functions log z, H=D-1z

(iii) for exponential functions exp z, H=Dexp z

Forthofer and Koch (1973) examined functions of the form

(3.2.20)

for which

(3.2.21 )

and

74

(3.2.22)

for which

(3.2.23)

Here,al=AlP,a2~ exp(A2logal), and a3=A3a2. Some functions

which take the form (3.2.20) or (3.2.22) include complex

ratio estimates such as rank correlation coefficients, kappa

statistics, and survival rates.

3.2.3 An Example of a Strictly Linear Model

One frequent application of weighted least squares is

tor a strictly linear model of the form

(3.2.24) P(n)= An = Xp

Such a model is appropriate when the An represent

subpopulation means or marginal probabilities. A is a known

(u*sr) matrix with full rank u 5 s(r-l) and its rows are

oblique to the natural constraints (3.2.4), i.e.

(3.2.25) Rank{A ' ,[l r e Is]} = Rank{A} + s

For this model, the estimated function vector is P=Ap and

the corresponding estimated covariance matrix is Vp=AVpA ' .

Thus, the WLS estimators for p can be written

(3.2.26)

and thegoodness-of-fit statistic QW 1s expressable in terms.

75

of X,A, and Vp as

The following example illustrates how WLSA is applied in a

strictly linear model situation.

Table 3.2 contains a subset of the CHESS data. Those

individuals in the California areas are cross-classified by

area, sex, and the number of colds which they reported in

1973.

Table 3.2Responses for Those California CHESS Subjects Who Reported

the Presence or Absence of Cold Symptoms in 1973

Area Sex 0 Colds 1 Cold 2 Colds 3 Colds

California I Male 377 187 74 18e California I Female 281 184 95 29California II Male 241 133 64 23California II Female 173 130 76 25

Total 1072 634 309 95

Note that no problems are produced by this area x sex x

colds crosstabulation. With the smallest subpopulation

sample size being n 4 *=404 (i=1,2,3,4), the application of

asymptotic theory to the analysis is justified.

In order to proceed with a linear model analysis of

the data, one needs to assume that the subjects in each of

the area x sex subpopulations are conceptually

representative of some corresponding larger subpopulation in

a manner consistent with stratified simple random sampling.

Then, the data of Table 3.1 can be described by the

76

following likelihood:

n 11'j=l ij

(3.2.26) Pr{y}=

4

ni=l

{ni*

4 Yi j / y }

i j !

where 11'ij denotes the probability that a randomly selected

subject from the i-th area x sex subpopulation has the j-th

response (j=1,2,3,4). Yij represents the number of subjects

from the i-th subpopulation with the j-th response. Since

the responses are ordinally scaled, ranging in intensity

from low to high, mean scores are an appropriate function to

analyse. Consider the following response function:

(3.2.27)

,III =Ap={[O 1 2 3] e I 4 }pIIII

oJ

Pre-multiplying p by A creates a function vector containing

estimates for the mean number of colds for each

subpopulation. P(p) is approximately multivariate normal .

and has the estimated covariance matrix VF=AVpA. '

Since there is no a priori model for these functions,

the cell mean model is first fit to gain information on the

significant sources of variation for the mean estimates.

This model can be expressed as

E{F(p)} = Ap = Xp = I 4P = PI

In this case, the estimates for PI are bI=F. Since this is

77

a saturated model, there is no goodness-of-fit statistic

defined. Linear hypotheses were undertaken to assess the

sources of variation in PI. The first hypothesis to be

investigated is whether there are differences among the 4

subpopulations, i.e.

Ho: PI1 = PI2 = PI3 = PI4

The corresponding C matrix is

r 1I 1 0 0 -1 II I

C = I 0 1 0 -1 II II 0 0 1 -1 IL ~

QC = b'C'(CVbC,)-1Cb = 32.39(d.f.=3) for this hypothesis

test, and is clearly significant(a=.01). This indicates the

need for additional investigation. The two California areas

are compared with their averages over sex via the contrast

matrix

C = [ 1 1 -1 -1 )

and found to be significantly different as the resulting Qc= 8.22(1 d.f.). Similarly, the hypothesis of a sex

difference is tested with a C matrix investigating their

averages over area:

C = [ 1 -1 1 -1 )

QC=20.96(d.f.=1) and is also clearly significant. Sex x area

78

interaction is investigated with

C = [ 1 -1 -1 1 ]

This is also a test of the additivity of sex and area.

QC=.08 and is non-significant. This finding is equivalent to

finding the estimates compatible with the reduced model

r 11 1 0 0 1 r 11 I 1 1I 1 0 1 1 1 13 1 1I 1 1 I

E{F{p)} =1 1 1 0 1 1 13 2 I = XR1 1 1 11 1 1 1 1 I 13 3 II I 1 1L. .. L. ..

13 1 represents the predicted value for California I males,

while 13 3 represents an increment for females and 13 2

represents an increment for California II. The estimated

parameters and their covariance matrix are

b = [.596 .113 .181]

r 1I .8108 -.5435 -.59951

Vb = I I1-·5435 -1.500 -.03911 x 10-3

1 I1-·5995 -.0391 1. 4331L. ..

The goodness-of-fit statistic for this model is identical to

the test statistic for the no interaction hypothesis since

that hypothesis is the implied constraint for the model.

79

3.2.4 Case Record Data

Often, a subject is classified according to d

categorical variables for which the k-th has the possible

outcomes j=1,2, ... Lk . Thus, there would be

dr = n L

kK =1

possible multivariate profiles. When d>3, this number can

become very large. Subsequently, the s x r proportion

vector becomes large, may include many zero frequencies, and

the necessary matrix operations required to produce a

covariance matrix difficult to perform from a computational

point of view. These potential problems can be circumvented

by operating on the raw data for each sUbpopulation to form

the function of interest, which will generally be

subpopulation means. This is referred to as case record

analysis.

Specifically, let there exist s subpopulations from

which samples of size {ni *} have been selected. Let

1=1,2, ... n i * index the subjects from the i-th subpopulation.

Then, let Yil=(Yil1'Yi21' ...Yidl)' represent the vector of

responses for the l-th subject. The sample mean for the k

th response in the i-th subpopulation can be expressed as

80

1(3.2.28)

(k) (k) (k)where n i =(ni1 , ... ,niLk ) denotes the vector of

frequencies nij(k) for the j-th outcome of the k-th response

in the sample from the i-th subpopulation. Let

ak=(ak1,ak2, ... akLk)I represent a set of finite scores

corresponding to the outcomes of the k-th response. The

corresponding vector of proportions is written

P (k)=(n (k)/n )i i i*·

It follows from the Central Limit Theorem that the

Yi=(Yil'Yi2"'Yid) have approximate multivariate normal

distributions when the n i * are sufficiently large (ni * ~20)

with

1(3.2.29)

with Pi being the vector of means for the i-th

subpopulation, and the covariance matrix L. estimated by1

(3.2.30)1

ni*L (Yil-Yi ) (Yil-Yi ) I.

1=1

If Y= (y11 ,y2

1 , •••ysl)' denotes the vector of sample means

for all the responses for all sUbpopulations, and

P=(P1

,P2', .••P

S)' denotes the corresponding expected value

81

vector, then a strictly linear model can be written as

(3.2.31) E{y} = P = X~

where X is a known (ds*t) design matrix with full rank t ~

ds and ~ is a (t*l) vector of unknown parameters. The

covariance matrix of y has the following consistent

estimator:

(3.2.32) VY

1

,I

Odd IIIII

1. I----- ~ I

n sls* ~

At this point, weighted least squares estimation can be

applied as in section 3.2.2 to generate an estimate b for ~.

Similarly, the Wald statistic QW is appropriate to assess

the goodness-of-fit of the model (3.2.31), where W is an

[(sd-t) x sd] orthocomplement to X, and has an approximate

chi-square distribution with d.f.=(sd-t). For the same

types of arguments as applied in section 3.2.2, b has an

approximate multivariate normal distribution, so that linear

hypotheses of the form Ho:C~=O can be investigated with the

test statistic QC. V ={XlV_- 1X}-1 is a consistent estimatorb y

of Var{b}. Qc is distributed as approximately chi-square

with d.f.=c.

82

The weighted least squares methodology is also

appropriate for the more general functional linear model.

Let p(~)=[pl(~),p2(~)'.. pu(~)]' be a set of ~ds functions

of interest, X be a known (u*t) design matrix of full rank

tsu, and P be an unknown (t*l) vector of parameters. 1'( )

must satisfy the conditions of having continuous partial

derivatives through order two in an open region containing

~, and also VI' must be nonsingular, where

r , r , ,la I' I I laP I

V =1---1 IV 1---1 II' laz Iz=YI Ylaz Iz-yl

~ ~ ~ ~

Thus, functional linear models which apply to mean

vectors {Yi

} are conceptually analogous to those which apply

tomultinomially distributed counts {Yij}. There is an

advantage, however, in that the sample size requirements are

less stringent since they pertain to the {ni *} instead of

the individual {Yij}. This is due to the fact that one is

supporting multivariate normality for a mean vector as

opposed to a compound vector of proportions. Also, working

with case record data to calculate y and V_ is often morey

computationally feasible than working with the corresponding

contingency table.

83

3.3 Overview of Repeated Measurement Analyses

A repeated measurements study can be broadly defined

as one in which each of the units investigated are measured

under two or more different conditions. Some types of

studies encompassed by repeated measurement designs are

split-plot experiments and change-over design studies.

Another class which includes the CHESS study is longitudinal

studies; for these, the different conditions represent time.

These are especially appropriate when the outcome of

interest may exhibit trend; consequently, many medical and

epidemiological investigations are longitudinal in nature.

While some investigations involve the use of just one

of the multiple responses recorded for a given subject (e.g.

endpoint analysis, see Gould 1980), most take advantage of

the multivariate nature of the data. The type of analysis

which is appropriate for a repeated measurements study

depends on a number of factors. These include whether the

point of the statistical analysis is to address specific

hypotheses or to generate a descriptive linear model. The

structure of the pertinent covariance matrix, the

distributional assumptions which can be made for the

variables under study, and the sample sizes are involved.

Potential strategies include profile analysis, univariate

analysis of variance for a summary measure of the response

variables, MANOVA for two or more summary measures, repeated

measurements analysis of variance when the respective

84

covariance matrix is compound symmetric, and generalized

linear modeling for serial measurements. Koch, Amara,

Stokes, and Gillings (1980) contains an overview of the

customary approaches to repeated measurements data and the

situations in which they apply. The following discussion is

directed at some of these commonly used strategies for

repeated measurements in the continuous data setting.

Table 3.3.1 displays a general data structure for

repeated measurements data.

Table 3.3.1Data Structure for Repeated Measurements

Subject inGroup Group Responses Within Group

1 1 Y111 Y121 Y1d11 2 Y121 Y122 Y1d2

1 n 1 Y11n1 Y12n ... Y1dn1 . . . 1

s n Ys1n Ys2n Ysdns s s s

Let Yil = (Yi1l'Yi2l' •.. Yidl) I represent the vector of

responses for the l-th subject in the i-th group where

i=1,2, ... s and l=1,2, ... n i . n i is the total number of

subjects in the i-th group. If interest lies in comparing

the response profiles among group~, when the measurement is

interval scaled, then standard multivariate analysis of

variance techniques can be applied. Such comparisons might

involve testing for group and/or condition differences, or

for (group x condition) interactions. The usual general

linear model is written

( 3 . 3 . 1 ) E{Y} = X~

85

where Y =. (Yli"Y12 1'''.Ylnl l 'Y21 1 ' Ysns l )1 is the (n x d)

observation matrix, Yil = (Yill'YiI2' Yild)1 is the vector

of d observations for the l-th subject in the i-th group,

and n is the total number of subjects. X is an n x q

~pecification matrix for the respective groups, and ~ is a

(q x d) matrix of unknown parameters. The Yil are

distributed as independent, multivariate normal with

expected value Pi and covariance matrix ~, where ~ is

positive definite. It is assumed that there is no missing

data.

Under these specifications, the elements of ~

represent response variable means for each of ~he s groups.

The least squares estimator of ~ is

(3.3.2) b = (X1X)-lX1Y

The questions of interest can be addressed through linear

hypotheses concerning the elements of ~ of the form

(3.3.3) C~A = 0

where C is a c*q matrix of constants and A is a (d x s)

matrix of constants. Using the between groups sums of

crossproducts matrix

86

and the within groups sum of crossproducts matrix

-1one can construct appropriate test statistics from QHQE

such as Wilk's lambda criterion, Roy's largest root, or the

Lawley-Hotelling trace criterion. See Timm (1975), Timm

(1980), or Morrison (1967) for further details concerning

such tests.

The question of whether there is a group by condition

interaction is investigated with (3.3.3) with the matrices

(3.3.4)

Keeping C as expressed in (3.3.4) and writing A=Id

would

result in testing whether there are differences among the

groups; similarly, keeping A as expressed above and writing

C=Is would result in the test for differences among the

conditions. If the hypothesis of no group x condition

interaction is tenable, then the hypothesis of no group

differences is equivalent to the hypothesis of equal group

means. This can be accomplished by writing A as A = 1d and

keeping C as in (3.3.4). If Y were postmultiplied by A, one

would be entertaining a univariate analysis of variance on

group means. Similarly, writing C = [n1 ,n2 , ... nS

] and

keeping A as in (3.3.4) would yield a test of equal

condition means, given that parallelism was again a valid

87

assumption.

Another possible choice for A in expression (3.3.4) is

for it to consist of an orthonormal set of all contrasts.

If it can be assumed that the covariance matrix of A'y can

be written

Var{A'y} = A'VA = Iv

a property known as sphericity, then univariate results can

be obtained. A model incorporating sphericity as an

assumption is the traditional mixed model analysis of

variance. This assumption is equivalent to the assumption

of homogeneous variances and covariances, which is often

reasonable in split-plot experiments where conditions are

randomized within subjects and the observational units are

usually interchangeable. Also, such experiments often

include limited numbers of observations per group such that

multivariate methods are ineffective. Thus, treating the

situation as univariate becomes essential in developing

analyses.

The appropriate model is written

( 3 . 3 . 5 )

with the {si.l) representing independent sUbject effect~,

(P ij } representing fixed effects for the i-th group and j-th

response, and {eijl } denoting independent response errors.

An appropriate F statistic with which to test the hypothesis

88

*of no group x condition interaction can be written F =

MSGC/MSE (where MSGC indicates the mean square due to group

x condition and MSE mean square error) with a rejection

*region of F >' F1_a

[(d-1)(S-1),(d-1)(n-s)]. See Timm

(1975), Morrison (1976) and Koch, Elashoff and Amara (1985)

for details concerning repeated measurements mixed model

ANOVA tables. Also, several authors have discussed

correction factors which can be applied to generate

approximate F-tests when sphericity is not satisfied. See

Greenhouse and Geisser (1959), and Huynh and Feldt (1976).

Sphericity is usually an unreasonable assumption in

longitudinal studies. Often, those subject measures closely

spaced will be more highly correlated than those further

apart. However, due to missing data or the need to account

for condition-varying covariates, multivariate analysis of

variance methods may also be unsuitable. A strategy gaining

in use is to consider the problem a univariate regression

situation in which the responses are correlated. The data

situation is that of having n observation vectors Yi

(i=1,2, ... n), each with d i measurements reflecting d i

conditions. One assumes that a linear model can be written

(3.3.6)

where Xi is a design matrix for the i-th individual

reflecting functions of time and covariates. The Yi are

considered to be distributed as multivariate normal.

However, the covariance structure

89

Var{yi } = L i

is not presumed known, as model fitting consists of

estimating both p and L i . Two types of models considered

for this problem are random effects models and

autoregressive models. Ware (1985), Laird and Ware (1982),

and Cook and Ware (1983) contain discussions of such linear

models for longitudinal data.

3.4 Repeated Measurements Analysis for Categorical Data

Linear modeling of repeated measurements data when the

data is categorical in nature lends itself quite readily to

the weighted least squares methodology discussed in se~tion

3.2.2. While decisions concerning the manner in which to

deal with correlated within subject responses is a major

facet of continuous case analyses, resulting in strategies

such as modeling patterned covariance matrices, WLSA was

specifically intended for modeling correlated functions.

Attention does need to be paid to the selection of

appropriate functions to analyse, since adding response

variables due to time or conditions to any categorical data

setting increases the chances of zero frequencies in many of

the contingency table cells produced by cross-classifying

all subpopu1ation levels and response variable levels.

The data structure involved can be summarized as

follows. Let the response outcome consist of L possible

categories, and have the response be measured over d

conditions. Thus, there will be r=Ld possible multivariate

response profiles. In Table 3.1, each entry Yij will

90

represent the frequency of a particular response profile for

the i-th group. These frequencies are better denoted as

Yij

where j is a vector subscript j=(j1,j2, .. jd),jg=1,2, •.. L for

g=1,2, .. d. Thus, j will have as its values a particular

profile. Let n ij denote the joint probability of response

profile j for subjects from the i-th subpopulation. A

function which allows one to address the usual questions of

interest in a repeated measurements analysis (i.e. group and

or condition differences, group x condition interactio~) is

the first order marginal probabilities. The following

discussion of the analysis of these and other functions is

taken from Koch et ale (1977), and Landis and Koch (1979).

The first order marginal probability is written

i=1,2, .. sfor g=1,2, .. d

k=1,2, .. L

which represents the probability of the k-th response

category for the g-th condition in the i-th subpopulation.

The following hypotheses can be investigated in terms of the

marginal probabilities:

H1 : There are no differences among the marginaldistributions of the respective attributesat each time point for the s subpopulations

There are no differences among the marginaldistributions of the respective attributesover the d time points within each of thesubpopulations

91

H3 : There is no group x condition interaction withrespect to the marginal distributions

If H1 is true, then the {t igk } satisfy the following formal

hypothesis:

HSM : t 1gk = t 2gk = ... = t sgk for g=1,2,.d and k=1,2, .. L

If there are no differences among the conditions (H2 ), then

the following hypothesis of first order marginal symmetry

(homogeneity) must hold:

HeM: t i1k = t i2k = ... = t idk for i=1,2,.s and k=1,2, .. L

If H3 holds, then the {t igk } can be described by an additive

model

HAM: t igk = Pk + ~i*k + T*gk

Here, i=1,2, ... s; g=1,2, ... d, and k=1,2, ... L. ~k is an

overall mean associated with the k-th response category,

~i*k is an effect due to the i-th subpopulation, and T*gk is

an effect due to the g-th condition. It is assumed that the

usual ANOVA constraints are satisfied by the {~k}'{~*gk}'

and {T*gk}; also the t igk sum to 1 for each g,k.

If the response categories are ordinally scaled, then

mean scores may be an appropriate function to analyse. Let

~. be a mean score for the responses for the g-th~g

condition in the i-th subpopulation:

i=1,2, ... Gfor

g=l ,2, ... d

92

The {ak } represent appropriate scalings. The hypothesis

HSAM : ~ig = ~2g = "'~sg for g=1,2, ... d

is satisfied when there are no differences among the

subpopulations, and the hypothesis

HCAM : ~i1 = ~i2 = .• = ~id for i=1,2, ... s

is satisfied when there are no differences among the

conditions. HSAM is implied by HSM ' and similarly, HCAM is

implied by HCM ' Additionally, a model similar to HAM can be

specified for the hypothesis of no group x condition

interaction.

In situations in which sample sizes are moderate at

best, the mean score may be the most appropriate function to

analyse. When attention is directed at the first order

marginal probabilities {tigk }, then sample size requirements

are that most of the marginal frequencies be greater than 5,

and the subpopulation sizes be greater than 20. Situations

where L is larger often involve ordinally scaled data so

that mean score functions are justifiable.

While both estimates of marginal probabilities and

mean scores can be generated from the application of A

matrices to a vector of proportions p produced from a

contingency table, repeated measurements data frequently

lends itself to the case record strategy described in 3.2.4,

especially for generating mean scores. For the repeated

measurements situation, the components of the function

vector Vi would represent mean scores for each of the k

93

weighted least squares analysis, utilizing Wald statistics

to investigate hypotheses of interest, is essentially the

same type of regression analysis discussed in section 3.2.2.

3.5 Missing Data Strategies for Categorical Data Analysis

For most longitudinal data studies, there are subjects

with incomplete data. The easiest way to deal with missing

data would be to delete those observations which include

missing values (list-wise deletion), but this often leads to

the loss of a great deal of information if the number of

observations with incomplete data is large. Also, this may

lead to biased results. Timm (1970) and Gleason and Staelin

(1975) discuss strategies for missing data in the general

multivariate linear model setting in which attention is

focused on estimating the covariance matrix. Some

techniques discussed involve estimating the missing data

from regression estimates or principal components estimates

from the complete data. Kleinbaum (1970) proposed

estimating the covariance matrix from pair-wise complete

data. Still others have suggested the derivation of maximum

likelihood estimates on the assumption that the data are

distributed as multivariate normal. Laird and Ware (1982)

discuss the use of the EM algorithm to derive ML estimates

in a longitudinal data setting under the assumption of

random-effects models.

Sections 3.4.1 and 3.4.2 discuss two missing data

strategies which are appropriate in a categorical data

setting; Section 3.4.1 discusses the use of a ratio

94

estimation procedure which is applicable to either

continuous or categorical data. Section 3.4.2 is concerned

with the strategy of supplemental margins in which those

subjects with incomplete data are treated as a subset of the

entire dataset and used in the estimation of parameters for

which they contain pertinent information.

3.5.1 Ratio Estimation

A missing data strategy which can be applied to either

continuous or categorical data and involves neither the

estimation of missing values nor distributional assumptions

is that of ratio estimation, which is discussed in Stanish,

Gillings, and Koch (1978). This procedure involves the use

of multivariate ratio estimates incorporating the use of

indicator variables which denote the presence or absence of

data for a particular response variable. The basis for the

method is in a paper by Cornfield (1944). This strategy is

applied in a case record analysis setting. The procedure is

aSYmptotic, and thus requires sufficient sample sizes for

each of the sUbpopulations under investigation in order that

the ratio estimators be approximately normal and the

estimated covariance matrix consistent. Asymptotic

regression methodology can then be used to describe the

variation among the estimates and address hypotheses of

interest.

Let Yil=(Yi1l'Yi2l' ... Yidl)' represent a set of

responses for d outcome measures for the l-th sUbject in the

i-th subpopulation. The d variables can be either continuous

95

or categorical in nature. In the repeated measurements

setting, the d variables might denote d times or conditions.

Let the expected value of Yil=Ui=(Ui1'Ui2""Uid) '. The

basic problem is to determine an estimate Ui for Ui and to

estimate its covariance matrix in the face of missing values

among the components of the Yil'

Let ui=(ui11,ui21, ... uidl)' be a random vector of

indicator variables, where u ikl has the value 1 if the

response Yikl is observed and o otherwise. The ratio

estimator of Uik can be expressed as

( 3 . 5 . 1 )

r nI 1

R. k=1 I:~ I

L. 1=1

(f In)1kl i

,II1

.J

r n ,-1I i II I: u In II ikl ilL. 1=1 .J

= exp{log(f )-log(u )}ik ik

where fikl=Yikluikl' Thus, f ikl takes the response value if

it is observed, and otherwise is set to O. This estimate is

equivalent to what would be obtained if the missing data for

the k-th response variable were replaced with the mean of

the complete data. However, since the construction of the

estimator is based on the sample as a whole, one can take

advantage of its structure to estimate a covariance matrix.

Let gil=(f i11 , fi21,.··fidl,ui11,ui21,·.·uidl)and gi be

the sample mean vector of the gil's. Then the ratio

96

estimator can be written

(3.5.2) Ri = exp{Alog(9i )}

with A=[Id,-Id ]. The covariance matrix of 9 i is written

(3.5.3) v ...9

i

since Ri is a function of 9 i , a consistent estimator for the

covariance matrix of Ri can be written as

(3.5.4) -lV D -lAID

gi gi Ri

( 3 • 5 .5)

as a consequence of applying the operation

v = HV HIF gi

the appropriate number of times where H is the matrix of

first derivatives of the fqnctions F evaluated at gi. This

will yield a positive semi-definite covariance matrix.

Stanish, Gillings, and Koch (1978) derive the form of the

general term of VR , and illustrate that the covariance

between the k-th and kIth ratio estimators is

r 2 1=Inkkl I

(3.5.6) CoV{Rik,Rikl} I I vkklInknkl IL .J

where nk is the number of observations with data for the k-

th response variable, nkl is that corresponding value for

97

the kIth variable, and n kk, is the number of observations

which have values for both variables. v kk, denotes the

covariance between the k-th and k'-th ratio estimators,

based on

the term

which is

these nkk, observations. Thus, one can consider

2{nkkl /nknkl } a missing data correction factor

applied to the conditional covariance to generate

Now, letg=(gl,g2, ...gS)' be the vector of means for

the s subpopulations, and similarly, R=(R1,R2,.~.Rs)' with

expected value P=(P 1 ,P 2 , ... P s )'. Analysis of R can then be

undertaken via asymptotic regression models of the for-m

( 3 • 5 • 7 )

An assumption of the ratio estimation procedure is that the

missing data occur at random, i.e. whether or not a variable

is observed is not related to the value it would have had if

it had been observed. The missing data indicator vector u il

should not be associated with the data vector Yil. There is

also a limit to the amount of data which can be missing.

Strictly speaking, this should probably be at most ten

percent.

3.5.2 Supplemental Margins

Another procedure which allows one to deal with

missing data in the categorical data setting is referred to

as supplemental margins. This strategy is more appropriate

for frequency data, and is described in Koch, Imrey and

Reinfurt (1972) and Lehnen and Koch (1974). The data in

98

this procedure are considered to have two components, those

observations which include values for all possible response

variables, and those observations which only have values for

subsets of the'response variables. The method is suitable

when the missing data occurs at random, due to either non-

response or bad data, or because additional samples have

occurred which did not include measuring all the response

variables. Even then, it must be assumed that whatever

process determined the subset of response variables measured

was independent of the values those observations would have

for all the variables.

The methodology involved is very similar to that in

the weighted least squares analysis of frequency data as

described in section 3.2.2. Let the data vector be written

as Y=(Y1' 'Y2',···ys')', where Yi'=(Yi1'Yi2' ...Yiri). Here,

i=1,2, ... s indexes the independent subpopulations and

j=1,2, ... r i denotes the response category in the i-th

subpopulation. Note that the number of response categories

and what they represent is allowed to vary from

subpopulation to subpopulation. Y can be considered to

follow the product multinomial distribution and its

likelihood function is written

(3.5.8) Pr{y)

where wij

is the probability that a randomly selected

subject from the i-th subpopulation has the j-th response

99

for that subpopulation. Note the difference from the

likelihood (3.2.3) due to the use of r i instead of an

across subpopulations r.

Let Pi = (yi/ni *) represent the proportions vector for

the i-th subpopulation, and let PG be a compound vector

defined by

E(p)=nG, where wG'= (w 1 ',w 2 ' , ...ws '). Var{PG} will be a

block diagonal matrix with the i-th block given by

A consistent estimator for Var{PG} will be the block

diagonal matrix V(PG)' where Pi is substituted for wi.

V(PG) has the same form as V(p) discussed above in a

'complete' multinomial data setting, but has a different

dimension since the diagonal blocks will be of size ri*r i

instead of a uniform r*r.

Let F=[(F1 (PG),F2 (PG), ••. Fu (PG)]' be a set of u

functions of PG. If the previously stated condition of

requiring the elements of F to have partial derivatives with

respect to PG in a region containing wG is upheld, then the

covariance matrix of F is estimated by

r , r , ,= IH(PG)I V IH(PG)I

L. .J p L. .J

G

100

where

r=1 aF II-II a z IL.

,II

z=p 1G..

A linear model describing the variation among the function

estimates can be expressed as

EA{F} = X~

For many situations, F(PG) will be linear functions

obtained from F = APG: these would be estimates of marginal

probabilities or cell probabilities. An example will serve

to illustrate the use of supplemental margins in an

analysis.

101

Table 3.5.1Presence or Absence of Cold Symptoms in

1973,1974, and 1975 for Female Subjects inBirmingham

Years Pattern of Response Frequency.Included 1973 1974 1975

73,74,75 Y Y Y 175Y Y N 80Y N Y 46Y N N 46N Y Y 64N Y N 64N N Y 46N N N 70

73,74 Y Y 51Y N 19N Y 25N N 26

73,75 Y Y 17N Y 6Y N 9N N 8

74,75 Y Y 233Y N 127e N y 87N N 112

73 Y 433N 271

74 Y 363N 235

75 Y 444N 317

Table 3.5.1 contains the frequencies of responses

corresponding to the possible response profiles for whether

or not a cold was reported during each of the years 1973,

1974, and 1975 for remales in Birmingham. The data are

considered to come from seven subpopulations, corresponding

to the possible combinations of years for which the subjects

had data present. The number of possible response profiles

r i ranges from 2 for the single year subpopulations to 8 for

102

the complete data subpopulation.

Let the data be represented by the array

r , r ,IY'345 1 = 1115 80 46 46 64 64 46 10 IIY' I = I 51 19 25 26 Ily,34 I = I 11 6 9 8 Ily,35 I = 1233 121 81 112 Ily,45 I = 1433 211 Ily,3 I = 1363 235 Ily,4 I = /444 311 IL 5 .I L .I

Let the proportion vector for the i-th subpopulation be

represented by Pi = (Yi/ni*)' where i takes the values

{345,34,35,45,3,4~5} for the combination of years 1913,1914

and 1915. These linear functions which are the marginal

probabilities of having colds for the years 1913,1914 and

1915 can be generated by

where A is the following 12*26 transformation matrix:

r111100001100110010101010

11001010

11001010

11000011

10

A =

L

,IIII/1III/

10 /101

.I

It follows that the functions can be described with the

following model:

EA{F} = Xn

103

where 11" = (11'13,11'74,TT 75 ) is the parameter vector where TT k

denotes the probability of having a cold in the k-th

year(k=1913,1914,1915).

r ,1 0 010 1 010 0 111 0 010 1 01

E{F} = X11' = 1 0 010 0 110 1 010 0 111 0 0\0 1 010 0 1\

L ,J

The estimates for the covariance matrix of F is given by

VF=A[V(PG)]A'. The goodness of fit for this model is QW =

6.25, with d.f.=9, thus indicating an adequate fit. This is

support for the assumption that the same parameters

TT 73 ,11'74,11'75 represent the probability of reporting a cold in

1973,1914, and 1975 respectively for each of the

subpopulations in which they apply. The hypothesis that

these probabilities are the same for each of the years,

was investigated with a test of the form C~=O, where

104

r ,11 -1 01

C = 11 0 -11L J

The resulting test statistic QC=17.24 (d.f.=2) is

significant, a=.05, and thus this hypothesis is rejected.

Table 3.5.2 contains the estimated probabilities n 73 ,n74 and

n 75 , as well as their standard errors. For comparative

purposes, these estimates are also presented for analogous

analyses when only the subjects from the compete data

subpopulations were included, and also for the analysis for

the data coming from the 1 and 2 year subpopulations only.

Table 3.5.2Estimated Probabilities and Standard Errors

Subjects With Subjects from OneParameter All Subjects Complete Data or Two Years

TT 73 .599 .587 .523( .013) ( .020) ( .023)

n 74 .635 .648 .515( .011) ( .020) ( .024)

TT 75 .574 .560 .469( .011) ( .020) ( .019)

3.6 Summary

Weighted least squares methodology is appropriate for

a variety of analysis situations for categorical data. The

function vector of interest can take a broad range of forms,

depending on the hypotheses to be addressed, the data

structures, and the strength of the sample sizes. Linear

models fit to the function vector can provide a description

105

of the variation among the function estimates as well as

providing a framework in which to test the hypotheses of

interest. Section 3.4 discussed the specific data situation

of repeated measurements, and the use of linear models for

categorical data to analyse such datasets. The chapter

ended with a discussion of two strategies for dealing with

incomplete data in the categorical data setting.

The CHESS dataset provides an opportunity to

illustrate the application of weighted lea~t squares

methodology as described in this chapter to a large, complex

dataset. Chapter V will be concerned with the analysis of

the complete data. Linear model descriptions of several

functions of interest will be pursued for a

crossclassification of estimates identified by a variable

selection scheme described in Chapter IV. Chapter VI will

address the analysis of the incomplete data via the

application of multivariate ratio estimation and

supplemental margins techniques.

CHAPTER IV

ANALYSIS OF COMPLETE DATA: VARIABLE SELECTION

4.1 Introduction

The approach to the analysis of the complete data,

consisting of those 4002 individuals with responses for each

of the nine time points, will consist of two stages as

discussed in the beginning of Chapter II. The first

involves the use of randomization test statistics to

determine the extent of association in the data between the

response variables and evaluation variables, including both

the variable indicating the pollution level and also those

of a demographic nature. Once the major sources of

variation for the response variables are determined,

estimates corresponding to the cross-classification of those

variables can be produced and the variation among those

estimates modeled in the context of weighted least squares.

One can thus generate a linear model description of the

variation, including the estimation of model parameters.

Clearly, variable selection procedures can be useful in the

first phase of such an analysis.

Randomization test statistics including extensions of

the Mantel-Haenszel methodology can be useful tools in

. choosing among a set of explanatory variables those which

maximally explain the variation in a set of response

107

variables. Sections 4.2 and 4.3 describe an eXisting

strategy for variable selection in the case of categorical

data analysis and also an extension to the situation where

the response is multivariate. Section 4.4 will be concerned

with the application of this strategy to the complete data

to determine the explanatory variables for the subsequent

linear models analyses which will be discussed in Chapter V.

4.2 Variable Selection for Categorical Data

The statistician is often faced with a large number of

explanatory variables in a dataset with which to explain the

variation in the outcomes for the response measure or

measures. Especially when one is dealing with a very large

dataset and/or a great many variables, it may be reasonable

to limit the modeling phase of the analysis to a subset of

the variables, for both monetary and computational

considerations. This can be very important in categorical

data analysis, since weighted least squares analysis

generally involves modeling the estimates produced by a

cross-classification of the explanatory variables. If these

become too great in number, many of the cross-classification

cells will have inadequate sample sizes for the analysis to

proceed. Other modeling procedures commonly used for

categorical data rely on iterative algorithms to produce

maximum likelihood estimates. These can become

prohibitively expensive if a large number of variables are

involved.

Higgins and Koch (1977) describe a procedure for

108

variable selection with categorical data which has also been

described in Clarke and Koch (1976). The procedure is

somewhat similar to forward stepwise regression as used with

continuous data. Higgins and Koch applied their method to

an environmental dataset in which the response measure of

interest was dichotomous--the presence or absence of

byssinosis in cotton textile workers. The explanatory

variables included measures pertaining to work conditions

and demographics. The purpose of their procedure is to find

a subset of variables responsible for the most variation in

the response outcomes among the subpopulations defined by

the cross-classification of such variables.

First, Pearson chi-square statistics are computed for

the first order association between the response variable

and each of the explanatory variables. The statistic, Qp'

is divided by its degrees of freedom, and the first variable

selected is that with the largest chi-square per degree of

freedom. The process continues with the calculation of Qp

for the cross-classification of the response variable with

all two-way combinations of the first variable selected and

all the remaining variables. Qp/d.f. is determined for each

of these contingency tables, and again, the variable with

the highest chi-square per degree of freedom is eligible for

selection.

The authors discuss two possible termination

statistics. Termination statistic "a " is the sum of the

Pearson chi-square statistics for the association of the

109

response variable and eligible explanatory variables at each

realization of the possible combinations of the levels of

the variables already selected. Termination statistic 'b'

is the randomization statistic QAR' where the explanatory

variable is the eligible variable, and the q strata refer to

the possible combinations of the variables already selected.

The advantage of the termination statistic "a" is said to be

that it assesses total association; its weaknesses are that

as the selection process goes on, its degrees of freedom

become so great that it becomes extremely stringent and the

data become so sparse that its validity as a chi-square

statistic becomes suspect. Termination statistic "b ll, as

the extended Mantel-Haenszel statistic, detects average

partial association and will have less stringent sample size

requirements.

Whatever termination statistic is employed, the

criterion for failure to include is a previously-decided

significance level, usually 0=.05 or 0=.10. At that point,

one is considered to have a reasonable set of variables for

further analysis since none of the remaining variables have

a significant effect once the selected variables have been

taken into account. While other reasonable subsets of

variables may also exist, the motivation for the Higgins

Koch algorithm is to find one which maximizes the variation

among the response variable outcome levels; thus it can be

considered of particular interest. It should also be noted

that, since the variable selected at each stage is the one

110

which in combination with other selected variables maximizes

the total variation among the corresponding outcome variable

levels relative to their degrees of freedom, the selection

criterion can be considered analogous to maximizing a mean

square due to regression in regression analysis.

4.3 Variable Selection Extended to Multivariate Response

. Profiles

One implied and implemented extension of the Higgins

Koch strategy is to datasets in which the response variable

is not dichotomous, but has more than two outcome levels.

Another extension which follows logically is to the

situation where one is interested in more than one dependent

variable at one time. An example of such situations is

repeated measurement designs in which one is also involved

in assessing variation of a response over time. In the

CHESS datasets, there is interest in determining a subset of

explanatory variables with which to form a cross

classification of estimates for the multivariate profile

corresponding to whether or not individuals had 2 or more

colds (2 + colds) in 1973, 1974, or 1975.

Questions of this kind can be addressed with the

multivariate randomization statistics discussed in Chapter

II. Variable selection can proceed along the same lines as

in section 4.2, with the screening process for explanatory

variables intended to find those with the maximum variation

among a set of multivariate profiles, rather than among the

levels of a single outcome variable. The strategy employed

111

in this chapter is to first calculate QAR for association

of the explanatory variables with the vector of summary

measures calculated from the response profiles. Then, QAR

is divided by its degrees of freedom, and the variable which

is significant at a=.05 and also has the largest chi-square

per degree of freedom is selected.

The procedure continues with construction of strata

for the levels of the variables already selected. The

statistic QAR for the average partial association of the

remaining explanatory variables on the dependent variable

summary measures across the strata is calculated for each

one. The variable with the largest chi-square per degree of

freedom which is also significant at the chosen a level is

selected as the next variable. Note that this process is

effectively the same as calculating all possible termination

statistics "b" in the Higgins and Koch paper (1977) and

bypassing the phase where one calculates a chi-square

statistic for an overall table where the subpopulation

levels are combinations of selected variables.

While the use of an average partial association

statistic may have the usual limitations, its use over the

total association statistic QTR seems defensible on the

following grounds. QTR wouid lose its effectiveness early

since its degrees of freedom would be q*(s-l)*u and would

become so great so quickly that QTR would be excessively

stringent as a screening device. Also, one would run into

sample size problems as well, since QTR

is linked to

112

individual stratum sizes instead of total strata sizes as is

QAR. Since the use of multivariate profiles tends to thin

data out quickly, sample size considerations are critical

for multivariate analyses.

The process continues with the construction of QAR for

the average partial associ~tion of the unselec~ed variables

over strata formed from the combination of the levels of the

selected variables until one has a subset of variables which

maximize the variation among the multivariate profile

summary measures simultaneously, and can proceed with the

linear modeling of such variation.

4.4 Application of Multivariate Variable Selection to Chess

Data for 2+ Colds and 1+ Asthma in 1973,1974, and 1975

As described in Chapter I, the health status

information collected was whether or not each child was

experiencing cold or asthma symptoms at each of the nine

time points of the CHESS study. Since the data were

separated into three yearly components, 1973, 1974, and

1975, summary measures were created for preliminary

descriptive purposes which included the total number of

colds or asthma or both, for each year. So, there are a

number of potential response measures to analyse, and the

choice necessarily depends on the purpose of the analysis as

well as the subjects it covers.

Table 4.4.1 displays the distribution of responses by

each year and overall for both colds and asthma.

113

Table 4.4.1

Frequency of Colds and Asthma in 1973,1974, and 1975

ColdsFrequency Overall 1973 1974 .1973

AsthmaOverall 1973 1974 1975

0 688 1936 1788 1858 3634 3801 3802 37471 875 1247 1271 1242 142 102 101 842 752 633 687 639 54 40 34 583 603 186 256 263 37 59 65 1134 443 265 272 236 192 157 112 168 48 159 17 40

Four summary measures were considered reasonable to

construct: one indicating whether subjects had 2 or more

colds in a given year, one indicating whether subjects had

one or more asthma events in a given year, one indicating

whether subjects had 0-1,2-3, or 4+ total colds and one

indicating whether subjects had 0 or 1 total asthma events.

Also constructed were the mean number of colds, and mean

number of asthma events for each year. Thus, there are six

reasonable response variables to analyse, three each for

asthma and colds. Those relating to total colds or total

asthma events over the three years obviously rule out the

possibility of assessing time as a source of variation, but

their consideration does provide a point of comparison for

the advantages of the multivariate analyses which do.

The primary focus of the analysis efforts for the

complete data is to assess the variation among those who had

114

2+ colds (and separately, 1+ asthma) in the three years

represented by the complete data. The variables available

in this dataset are as follows: SEX, AREA, RACE2 (white

versus other), RACE3 (white versus black versus other), AGE2

(younger versus older), AGE4 (four age groupings), and SPOLL

(low, medium, or high) pollution levels. There is also

interest in whether there was any variation over time. The

first stage involves using the variable selection scheme

presented in preceding sections of this chapter to determine

the appropriate set of explanatory variables. The second

stage involves the description of the variation with a

linear model. One of the objectives of this dissertation is

to describe a systematic process by which to accomplish both

of these stages, while being as efficient as possible with

respect to the size of the dataset.

The question of whether there is variation in the

distribution of 2+ colds (1+ asthma) across the three years

can first be addressed within the context of multivariate

randomization tests by creating summmary measures which

reflect the differences in the proportions of the outcome

between the years. If there is a difference, then one has to

incorporate time into subsequent anal~ses, either through

the use of multivariate response measures in the

randomization tests used as part of variable selection

strategies or time parameters in subsequent weighted least

squares modeling. Table 4.4.2 presents the results of

multivariate randomization tests for the association between

115

Table 4.4.2

Multivariate Randomization Tests for the Association of theExplanatory Variables with the Difference in 2+ Colds and

1+ Asthma Between 1973-1974, and 1974-1975

2+ ColdsVariable OAR d.f. P-Value

AREA 53.81 10 0.0000SEX 2.31 2 0.3146RACE2 3.53 2 0.1716RACE3 12.15 4 0.0162AGE2 5.36 2 0.0685AGE4 14.53 6 0.02'42SPOLL 7.62 4 0.1066

1+ AsthmaVariable OAR d.f. P-Value

AREA 8.81 10 .5499SEX 7.61 2 .0222

e RACE2 1. 72 2 .4242RACE3 2.40 4 .6626AGE2 1.10 2 .5763AGE4 1. 24 6 .9750SPOLL 3.25 4 .5166

the evaluation variables and two measures created from the

multivariate profiles: the first is the difference between

the proportion of those having 2+ colds (1+ asthma) in 1973

and 1974, and the second is the difference for 1974 and

1975. These summary measures were constructed via an A

matrix of the form

r ,10 -1 0 -1 1 0 1 0 I

A = I I10 0-1 -1 1 1 0 0 IL ~

QAR is the multivariate mean score statistic of interest,

116

with d(s-l} degrees of freedom, where d=2 for the two

response measures created. 0=[1 5 _ 1 ,-1], where s is the

number of levels of the evaluation variable.

Four of the seven tests indicate that a time x

evaluation variable interaction exists at the a=.05 level of

significance for 2+ colds, while only sex has a significant

time interaction for the 1+ asthma outcome measures

(p=.0222). However, one can conclude from these tests that

both 2+ colds and 1+ asthma interact with time for one or

more of the explanatory variables, and hence, that time

should be incorporated into the future analysis structure.

The variable selection thus proceeded by focusing on the

three proportions of 2+ colds (and 1+ asthma) for 1973,

1974, and 1975 simultaneously.

The same analyses were repeated, using

r ,I 0 1 0 1 0 1 0 1 II I

A = I 0 0 1 1 0 0 1 1 II II 0 0 0 0 1 1 1 1 IL oJ

to create summary measures which are the proportions of 2+

colds in 1973, 1974, and 1975. Table 4.4.3 shows the

relative ranking of the explanatory variables for QAR

divided by its degrees of freedom. The table also includes

the same information for the dependent variable 1+ asthma .

. SEX emerges as the first variable to include for both 2+

colds and 1+ asthma, as its chi-square per degree of freedom

117

is over five times as great as that of the next contending

variable for colds and over twice as much for asthma. QAR

Table 4.4.3

Relative Ranking of Explanatory Variables with RespectTo Their Association with 2+ Colds and 1+ Asthma in 1973

1974, and 1975 as Determined by QAR/d.f.

2+ ColdsVariable QAR d. f. P-Value QAR/d • f .

SEX 126.06 3 0.000 42.02RACE2 23.35 3 0.000 7.78AREA 104.86 15 0.000 6.99RACE3 33.22 6 0.000 5.54AGE2 9.25 3 0.026 3.08AGE4 21.17 9 0.012 2.35SPOLL 8.62 6 0.196 1.44

Variable QAR d. f. P-Value QAR/d . f .

e SEX 14.07 3 0.0028 4.69AREA 22.16 15 .1037 1.48SPOLL 8.00 6 .2381 1. 33RACE2 2.82 3 .4198 .94AGE2 1.92 3 .5894 .64RACE3 3.61 6 .7288 .60AGE4 2.85 9 .9699 .32

is significant for both colds and asthma at Q=.05 level of

significance. It should be noted that all of the tests are

significant (Q=.05) for colds except for the pollution

index, whereas SEX is the only significant variable

according to the same criterion for 1+ asthma.

The next step of the analysis was to assess the

strength of the association between the explanatory

variables and the response measures while controlling for

SEX. Strata were formed corresponding to males and females

118

and the average partial association statistics QAR computed.

Table 4.4.4 displays the relative ranking of the explanatory

variables in this second phase of screening according to

the same criterion QAR/d.f., where the degrees of freedom is

still going to be (s-l)*d. The table includes both 2+ colds

and 1+ asthma, since both dependent variables had SEX enter

as the first variable selected in the screening process.

Table 4.4.4

Relative Ranking of Explanatory Variables with Respect toTheir Association with 2+ Colds and 1+ Asthma in 1913,

1914, and 1915 as Determined by QAR/d.f. Controlling for Sex

2+ ColdsVariable OAR d. f. P-Value °AR/d . f •

RACE2 23.23 3 0.000 1.14 eAREA 100.08 15 0.000 6.61RACE3 33.36 6 0.000 5.60AGE2 1.15 3 0.052 2.58AGE4 18.46 9 0.030 2.05SPOLL 8.30 6 0.211 1.38

1+ ASTHMA

VARIABLE OAR d. f. P-VALUE °AR/d • f •

AREA 21. 23 15 .1295 1.42SPOLL 1.12 6 .2592 1. 29RACE2 2.13 3 .4355 .91AGE2 1. 86 3 .6011 .62RACE3 3.54 6 .1384 .59AGE4 2.81 9 .9114 .31

RACE2 becomes the second variable selected for the

yearly outcomes concerning 2+ colds with a QAR/d.f. of

1.143, compared to the next highest criterion of 6.612 for

AREA. The three level classifier for race, RACE3, had the

119

third highest criterion of 5.560. All of these variables

also had significant QAR at a=.05. The pollution index

variable again showed no association with a QAR of 8.30 and

6 d.f.(p=.2173). None of the explanatory variables for 1+

asthma are significant, although AREA has the highest chi

square per degree of freedom criterion of 1.42. Its p-value

for QAR is .1295, which may be considered borderline

significant for an a=.10 level and an analysis which has

screening as its primary objective rather than strict

hypothesis-testing.

Consequently, strata corresponding to a SEX x RACE2

crosstabulation were constructed, and QAR calculated for the

partial association of the remaining explanatory variables

with the proportion of 2+ colds in 1973,1974, and 1975,

adjusting for SEX and RACE2. The results of these

computations are shown in Table 4.4.5. AREA emerges as the

next variable to enter, with a chi-square per degree of

freedom of 6.65. The next largest value was 2.51 for AGE2.

No further work was done for 1+ asthma.

At this point, three variables have been selected by

the variable selection process for their association with 2+

colds --SEX, AREA, and RACE2, and two variables have been

selected for 1+ asthma--SEX and AREA. The selection process

could be carried forward one more time for 2+ colds to

assess whether the remaining explanatory variables were

associated with 2+ colds after controlling for the first

three variables selected. However, it is doubtful whether

120

this data could support stratification into more than 24

subpopulations for a linear models analysis. As a general

rule, the selection procedure for categorical data will

usually involve stopping after three variables have been

selected for this reason as well as for computational

considerations involving the subsequent analyses. Thus,

although the next step was completed and did not result in

any further significant associations, it well may have been

left out regardless.

Note that the pollution index variable SPOLL did not

survive the variable selection process. No first order

association between SPOLL and either 1+ asthma or 2+ colds

was found, and no association was found when other

explanatory variables were adjusted for in subsequent

analyses. Since the index is ordinally scaled, a more

appropriate test statistic would be the correlation

statistic discussed in Chapter II for assessing the

association with SPOLL as well as the other ordinally scaled

variables AGE2 and AGE4. This analysis was performed but,

but no different results were obtained. This is not really

surprising, given the fact that the quality of the pollution

data which the index was based on was debatable, as

discussed in Chapter I.

121

Table 4.4.5

Relative Ranking of Explanatory Variables with RespectTo Their Association with 2+ Colds in 1973,1974, and 1975as Determined by QAR/d.f. Controlling for Sex and Race2

2+ ColdsVariable Q

AR d. f. P-Value QAR/d . f .

AREA 97.44 15 0.000 6.50AGE2 7.52 3 0.057 2.51AGE4 18.38 9 0.031 2.04SPOLL 11.27 6 0.804 1.88

CHAPTER V

LINEAR MODELS ANALYSIS OF COMPLETE DATA

This chapter is concerned with the application of

weighted least squares methodology to the analysis of the

health status measures in the CHESS study for the children

with complete data. Variables selected as a consequence of

the randomization analyses of Chapter IV are the basis of

the subpopulations undeF study. Functions examined are the

proportion of children reporting one or more incidents of

asthma in a given year, the proportion of children reporting

two or more colds in a given year, the proportion of

children ever reporting one or more incidents of asthma per

year, and mean colds in a given year. Sections 5.1 and 5.2

discuss different alternatives for data adjustments to be

made when existing zero frequencies would induce covariance

matrix singularities. Additionally, Section 5.2 includes a

discussion of the use of residual analysis in model

selection for categorical data analysis.

5.1 Linear Model Analysis of 1+ Asthma Data

The variable selection in the previous chapter for the

proportion of 1+ asthma in 1973, 1974, and 1975 identified

the variables SEX and AREA as accounting for most of the

variation among the estimates of interest. Table 5.1.1

contains the cross-classification of those subjects having

Table 5.1.1

1+ Asthma Reported in 1973, 1974, and 1975

73 Marginals73 73 74 74

None 73 74 74 75 75 75 75 Total 73 74 75

Male

e Charlotte 185 3 0 1 3 1 5 1 199 6 7 10Birmingham 461 12 3 1 17 1 6 15 516 29 25 39NYC 76 1 1 0 0 0 0 0 78 1 1 0Utah 319 4 5 2 11 0 3 5 349 11 15 19California I 543 7 4 2 16 5 9 26 522 40 41 56California II 249 3 7 1 9 1 4 14 288 19 26 28

FemaleCharlotte 229 4 3 0 3 1 0 8 248 13 11 12Birmingham 543 11 5 1 1.4 2 4 9 ..589 23 19 29NYC 73 1 0 0 2 0 0 2 78 3 2 4Utah 353 6 2 1 7 1 2 14 386 22 19 24California I 418 7 6 3 6 4 2 11 457 25 22 23California II 275 3 2 1 2 0 4 5 292 9 12 11

123

124

1+ asthma by sex, area and year. Each of the columns

represents the number of individuals who reported asthma in

that period of time. 'None' means that no occurrences were

reported at any of the nine time points, '1974' entries are

those who reported an incident in 1974 only, '74,75'

entries represent subjects who reported incidents in both

1974 and 1975, while the column for '73-75' includes the

numbers of subjects who reported asthma in each of the three

years. The table also includes marginal frequencies for

each of the-three years. As might be expected, the 'None'

column dominates the table, containing approximately ninety

percent of the data.

The subjects in the sex x area strata are considered

to be conceptually representative of some larger

subpopulation in a sense equivalent to simple random

sampling. Their response profiles are assumed to be

independent, so that the data of Table 5.1.1 have the

product multinomial distribution, i.e.

where whij denotes the probability that a randomly selected

subject from the h-th sex, and i-th area has the j-th

profile; Yhij denotes the frequency of the h-th sex, i-th

area and j-th response profile in the sample of size nhi ..

Also, h=1,2 for male and female respectively, i=1,2 ... 6 for

the six areas, and j=1,2 ... 8 for the eight response profiles

depicted in Table 5.1.1. The functions of interest are the

125

first order marginal probabilities of reporting asthma in

197.3, 1974, "or 1975 for each combination of the levels of

SEX and AREA. Estimators for these linear functions can be

expressed in matrix notation as

r ,10 1 0 1 0 1 0 11I I

F=F(p)=Ap= 10 0 1 1 0 0 1 11 e I * pI I 1210 0 0 0 1 1 1 11L. .I

where P=(Yhij/nhi .) is a (96 x 1) vector of sample

proportions. Note that the data are relatively sparse due

to the domination of the 'None' response profile; this is

particularly true for the two NYC subpopulations. Since

forming A*p results in zero marginal probabilities for NYC

males, .5 was added to the 1975 cell so that the

computational strategy would be feasible (as suggested in

Koch, et. al. 1977). However, it should be pointed out that

the marginal frequencies for NYC are less than the 5 usually

considered a reasonable requirement in order to assume that

the proportion vector p has an approximate multivariate

normal distribution. The other 33 marginal frequencies are

all greater than 5. A consis"tent estimator of the

covariance matrix of F is given by VF=AVpA 1• VF has a

block diagonal structure which takes into account the

correlation among the functions produced from each

sUbpopulation.

A useful procedure for investigating the variation

126

among these estimates is to use the cell mean model to gain

a preliminary assessment. This model is formally stated as

E{F(p)}= An=X~=I~=~

The resulting parameter estimates b of ~ are thus the

function estimates and the corresponding estimated

covariance matrix Vb=VF . Since X has rank t=36, there is no

reduction in dimension and thus no lack of fit defined.

However, Wald statistics can be employed to investigate

linear hypotheses concerning~. Model-fitting efforts can

then continue with the model structures implied by the

results of such hypothesis-testing. The linear hypotheses

of interest and their corresponding Wald statistics

Qc=blCl(CVbCl)-lCb are displayed in Table 5.1.2.

Table 5.1.2Hypotheses and Resulting Test Statistics Concerning

Proportions of 1+ Asthma Reported in 1973, 1974, and 1975

Hypothesis D.F.P-Value

1. No difference between sexes for 1.97 1averages over (area x time)

2. No variation among areas for averages 15.89 5over (sex x time)

3. No variation among times for averages 14.02 2over (sex x area)

4. Homogeneity across areas of differences 17.27 5between sexes for averages across time

5. Homogeneity across times of differences 4.26 2between sexes for averages across area

6. Homogeneity across areas for differences 8.57 10across time for averages across sex

7. No time x sex x area interaction 6.78 10

Clearly, the three-way interaction is non-significant

.160

.007

.000

.004

.119

.573

.746

(~=.05), and so is the (area x time) interaction. However,

the (sex x area) interaction is significant, and the (sex x

127

time) interaction is suggestive. A model which incorporates

these results is X2 ' which includes separate intercepts for

each of the six areas, effects for sex within each area,

and separate time effects for each sex. Formally, this

model is stated as

E{F(p)}=F{n}=X2~2

where X2 is the (36 x 16) design matrix displayed in Table

5.1.3a. Table 5.1.3b contains the parameter estimates b 2

and their standard errors. The Wald goodness-of-fit

-1statistic for the model X2 is Qw=(F-X2b 2 )'VF (F-X2b 2 ) =

18.42(d.f.=20), and its non-significance(p=.56) supports the

model. Table 5.1.3b also contains Wald statistics

corresponding to linear hypotheses concerning~. ~1 is the

predicted reference value for males in Charlotte in 1973.

~2-~6 are the corresponding values for Birmingham, NYC,

Utah, California I and California II. ~7 represents an

incremental effect for females in Charlotte, and ~8-~12 are

similar effects for the other areas. ~13 is the incremental

effect for the year 1974 for males, ~14 is the

corresponding effect for 1975, and ~15 and ~16 are the time

effects for the years 1974 and 1975 for the females

re~pectively.

Thus, the significant (sex x area) interaction of the

previous preliminary modeling stage is accounted for in the

model X2 with separate sex effects for each area. The test

for no (sex x time) interaction 1n the model X2

(~13=~15'~14=~16) resulted 1n a Wald statistic Qc=8.69

128

Table 5.1. 3a

Specification Matrix for Reduced Model Xz for 1+ Asthma

Specification Matrix X2

1 0 0 0 0 0 0 0 0 0 0 0 o 0 0 01 0 0 0 0 0 0 0 0 0 0 0 1 0 0 01 0 0 0 0 0 0 0 0 0 0 0 o 1 0 00 1 0 0 0 0 0 0 0 0 0 0 o 0 0 00 1 0 0 0 0 0 0 0 0 0 0 1 0 0 00 1 0 0 0 0 0 0 0 0 0 0 o 1 0 00 0 1 0 0 0 0 0 0 0 0 0 o 0 0 00 0 1 0 0 0 0 0 0 0 0 0 1 0 0 00 0 1 0 0 0 0 0 0 0 0 0 o 1 0 00 0 0 1 0 0 0 0 0 0 0 0 o 0 0 00 0 0 1 0 0 0 0 0 0 0 0 1 0 0 00 0 0 1 0 0 0 0 0 0 0 0 o 1 0 00 0 0 0 1 0 0 0 0 0 0 0 o 0 0 00 0 0 0 1 0 0 0 0 0 0 0 1 0 0 00 0 0 0 1 0 0 0 0 0 0 0 o 1 0 00 0 0 0 0 1 0 0 0 0 0 0 o 0 0 00 0 0 0 0 1 0 0 0 0 0 0 1 0 0 00 0 0 0 0 1 0 0 0 0 0 0 o 1 0 01 0 0 0 0 0 1 0 0 0 0 0 o 0 0 01 0 0 0 0 0 1 0 0 0 0 0 o 0 1 0 e1 0 0 0 0 0 1 0 0 0 0 0 o 0 0 10 1 0 0 0 0 0 1 0 0 0 0 o 0 0 00 1 0 0 0 0 0 1 0 0 0 0 o 0 1 00 1 0 0 0 0 0 1 0 0 0 0 o 0 0 10 0 1 0 0 0 0 0 1 0 0 0 o 0 0 00 0 1 0 0 0 0 0 1 0 0 0 o 0 1 00 0 1 0 0 0 0 0 1 0 0 0 o 0 0 10 0 0 1 0 0 0 0 0 1 0 0 o 0 0 00 0 0 1 0 0 0 0 0 1 0 0 o 0 1 00 0 0 1 0 0 0 0 0 1 0 0 o 0 0 10 0 0 0 1 0 0 0 0 0 1 0 o 0 0 00 0 0 0 1 0 0 0 0 0 1 0 o 0 1 00 0 0 0 1 0 0 0 0 0 1 0 o 0 0 10 0 0 0 0 1 0 0 0 0 0 1 o 0 0 00 0 0 0 0 1 0 0 0 0 0 1 o 0 1 00 0 0 0 0 1 0 0 0 0 0 1 o 0 0 1

129

Table 5.1.3b

Estimated Parameters, Standard Errors and TestStatistics for Reduced Model Xl for 1+ Asthma

Parameter InterpretationSl: Predicted value for 1973 males for CharlotteS2: Predicted value for 1973 males for BirminghamS3: Predicted value for 1973 males for NYCS4: Predicted value for 1973 males for UtahS5: Predicted value for 1973 males for California IS6: Predicted value for 1973 males for California IIS7: Incremental value for 1973 females for CharlotteS8: Incremental value for 1973 females for BirminghamS9: Incremental value for 1973 females for NYCS10: Incremental value for 1973 females for UtahSll: Incremental value for 1973 females for California IS12: Incremental value for 1973 females for California IIS13: Incremental value for 1974 malesS14: Incremental value for 1975 malesS15: Incremental value for 1974 femalesS16: Incremental value for 1975 females

Estimates andStandard Errors

.0304 ± .0102

.0506 ± .0090-.0203 ± .0072

.0334 ± .0085

.0769 ± .0109

.0722 ± .0140

.0191 ± .0160-.0106 ± .0112

.0341 ± .0196

.0233 ± .0136-.0247 ± .0142-.0386 ± .0168

.0036 ± .0046

.0216 ± .0054

.0058 ± .0039

.0018 ± .0043

Wald Test Statistics For HvpothesesNo area variation for sex increments

(S7 = Pe = Ss = SlO = Sll = Sll)No time effect for males for 1974 (S13=0)No time effect for males for 1975 (S14=0)No time effect for females for 1974 (SlS=O)No time effect for females for 1975 (SlS=O)No (sex x time) interaction

QC

17.19.62

16.332.17

.1826.86

d. f.

511114

Goodness of Fit Qw = 18.42 d.f. = 20

130

with d.f=2; this is significant at (Q=.05). Hypotheses

concerning the individual time effects were then tested;

the effect for males in 1975 proved to be the only

significant one(Q=.05). The hypothesis of equality of (area

x sex) effects (~7=~8=~9=~10=~11=~12) was also assessed,

and proved to be contradicted (QC=17.19, with 5 d.f.).

In view of the significant (sex x area) and (sex x

time) interactions, it is of interest to describe their

nature. This will be done with subsequent models. These

models should be viewed as descriptive tools, inferentially

justified by the model X2 . P-values resulting from tests on

model parameters are used as guidelines for further

smoothing rather than as outcomes in formal hypothesis

testing. Accordingly, the model X3 was fitted, which

included a reduction of the 16 parameters to 13, excluding

the 1974 time increments and the 1975 increment for females.

The goodness-of-fit criterion for X3 is Qw = 23.82

(d.f.=23,p=.41), which is clearly supportive of the model.

Table 5.1.4 displays the parameter estimates for this model

as well as the resulting Wald statistics for linear

hypotheses concerning them. X3 has the same structure as X2

except that the columns corresponding to ~13' ~14' and ~15

have been removed. The Wald statistic suggests that a final

model can further reduce dimensionality. The reference

parameters for the California areas are not significantly

different from each other, and neither are the sex effects

for NYC and Utah nor the sex effects for the California

Table 5.1. 4

Estimated Parameters and Test Statistics forthe Reduced Model X3 for 1+ Asthma

Parameter Interpretation

I : Predicted value for males in 1913 &: 1914 in Charlotte2 : Predicted value for males in 1913 &: 1914 in Birmingham3 : Predicted value for males in 1913 &: 1914 in NYC4 : Predicted value for males in 1913 &: 1914 in Utahs : Predicted value for males in 1913 &: 1974 in Cali I6 : Predicted value for males in 1913 &: 1914·1n Cali II7 : Increment for females in 1913 &: 1914 in Charlottea : Increment for females in 1913 &: 1914 in Birmingham9 : Increment for females in 1913 &: 1974 in NYCI 0 : Increment for females in 1913 &: 1914 in UtahI I : Increment for females in 1913 &: 1914 in Cali II 2 : Increment for females in 1913 &: 1914 in Cali III 3 : Increment for males for 1915

ESTIMATES ANDSTANDARD ERRORS

I : .0320 ± .01002 : .0521 ± .00863 : .0001 ± .00664 : .0348 ± .0083s : .0181 ± .01066 : .0135 ± .01397: .0155 ± .0156a: -.0150 ± .01019: .0266 ± .019110 : .0200 ± .0133II: -.0281 ± .013112: -.0395 ± .016613: .0192 ± .0044

Wald Test Statistics For Simplifications

Equality of reference parameters for California I andCalifornia II, equality of sex effects for NYC and Utah(Ss = S6, S9 = SIO, S 1 I = S12) QC = 1.91 df = 3

r 0 0 0 0 1 -1 0 0 0 0 0 0 0 1C1 = 0 0 0 0 0 0 0 0 1 -1 0 0 0

L 0 0 0 0 0 0 0 0 0 0 1 -1 0 JNull sex effects for Charlotte, Birmingham, NYC and Utah(S7 = Sa = S9 = Sl 0 ) QC = 1.01 df = 4

r 0 0 0 0 0 0 1 0 0 0 0 0 0 1C2 = 0 0 0 0 0 0 0 1 0 0 0 0 0

l 0 0 0 0 0 0 0 0 1 0 0 0 0J0 0 0 0 0 0 0 0 0 1 0 0 0

131

132

areas (Q=.05). The tests supporting these conclusions were

simultaneously assessed with contrast matrix 01 shown in

Table 5.1.4. An additional hypothesis investigated was

whether the sex effects for Charlotte, Birmingham, NYC and

Utah were simultaneously null. The Qc for this test was

7.07 (d.f.=4), and thus these sources of variation can be

eliminated.

A final model X4 was then fitted which took into

account the results of the statistical tests concerning the

parameters of intermediate model X3 . The parameters for X4

are seven in number, and include predicted reference

parameters for subjects in Charlotte, Birmingham, NYC and

Utah in 1913 and 1914, and another for males in California

in 1973 and 1914. There are two additional parameters. One

is an incremental effect for females in California and the

other is an overall incremental effect for 1975 for males.

The goodness-of-fit statistic for X4 is 32.11 with 29 d.f.

Its non-significance(Q=.25) supports model X4 as adequately

describing the functions F(p).

Thus, the variation among the subpopulation estimates

for 1+ asthma is mostly attributable to differences in

areas. The (sex x area) interaction surfacing in the

results of Table 5.1.2 is isolated to California with this

model. The (sex x time) interaction is accounted for with

an effect for 1975 which is limited to males. Thus, the

effect of time in the estimates for 1+ asthma is very

limited. The estimates for ~4' the design matrix X4 , and

Table 5. 1. 5a

Estimated Parameters and Standard Errorsfor Final Model x.- for 1+ Asthma

Specification Matrix x.-l 0 0 0 0 0 01 0 0 0 0 0 01 0 0 0 0 0 10 1 0 0 0 0 00 1 0 0 0 0 00 1 0 0 0 0 10 0 1 0 0 0 00 0 1 0 0 0 00 0 1 0 0 0 10 0 0 1 0 0 0 Estimates and0 0 0 1 0 0 0 Standard Errors0 0 0 1 0 0 10 0 0 0 1 0 0 1 .0384 ± .00770 0 0 0 1 0 0 z .0430 ± .00510 0 0 0 1 0 1 3 .0034 ± .00620 0 0 0 1 0 0 4 .0426 ± .00650 0 0 0 1 0 0 s .0769 ± .00850 0 0 0 1 0 1 s -.0343 ± .01051 0 0 0 0 0 0 7 .0182 ± .00431 0 0 0 0 0 0

e 1 0 0 0 0 0 00 1 0 0 0 0 00 1 0 0 0 0 00 1 0 0 0 0 00 0 1 0 0 0 00 0 1 0 0 0 00 0 1 0 0 0 00 0 0 1 0 0 00 0 0 1 0 0 00 0 0 1 0 0 00 0 0 0 1 1 00 0 0 0 1 1 00 0 0 0 1 1 00 0 0 0 1 1 00 0 0 0 1 1 00 0 0 0 1 1 0

Parameter Interpretations

SI: Predicted value for CharlotteSz: Predicted value BirminghamS3: Predicted value NYCS4: Predicted value for UtahSs: Predicted value for CaliforniaSs: Increment for females for CaliforniaS7: Increment for 1975 for males

Goodness of fit: Qw = 32.71 d.f.=29 (p-value=.2894)

133

Table 5.1. 5b

Observed and Predicted Estimates of First OrderMarginal Probabilities for 1+ Asthma

134

135

the values of the original estimates for 1+ asthma as well

as their predicted values F(p)=X4b 4 are displayed in Table

5.1.5a and 5.1.5b.

5.2 Linear Model Analysis of 2+ Colds for 1973, 1974,

and 1975 for the Sex x Race x Area Cross-Classification

Variables for sex, area, and race (SEX, AREA, RACE)

were selected for inclusion in the modeling for 2+ colds in

1973, 1974, and 1975 in the variable selection process

described in Chapter IV. Table 5.2.1 displays the results

of cross-classifying the sUbjects by sex, race, and area

according to their response profiles. The table also

contains the marginal frequencies for having 2+ colds in

1973, 1974, and 1975. The data are distributed somewhat

more uniformly across the response profiles than they were

for 1+ asthma; however, the 'None' category still dominates

the distribution. Also, the addition of race to the cross

classification means that the number of subpopulations under

investigation doubles to 24 from the 12 studied in section

5.1. It is doubtful whether the inclusion of still another

cross-classification variable could have been addressed

within this analysis framework.

Similar to the previous analysis, the functions being

modeled are the first order marginal probabilities of

reporting colds in 1973, 1974, and 1975 for each

sUbpopulation corresponding to all possible combinations of

sex, race and area. The marginal frequencies which are the

numerators of such functions are displayed in Table 5.2.1

Table 5.2.1

Frequency of 2+ Colds Reported in 1973, 1974, and 1975

136

73 Marginals73 73 74 74

None 73 74 74 75 75 75 75 Total 73 74 75-- -- --

White MalesCharlotte 71 15 19 9 10 2 7 3 136 29 38 22Birmingham 189 14 35 15 22 8 14 6 303 43 70 50NYC 47 4 1 2 5 2 2 0 63 8 5 9Utah 171 25 37 7 50 9 12 9 320 50 65 80California I 298 31 39 6. 35 13 19 10 451 60 74 77California II 150 23 21 9 27 5 13 12 260 49 55 57

Other MalesCharlotte 41 7 8 4 2 0 1 0 63 11 13 3Birmingham 161 6 17 5 11 3 7 3 213 17 32 24NYC 13 0 1 0 1 0 0 0 15 0 1 1Utah 22 1 2 0 3 0 1 0 29 1 3 4California I 48 5 4 3 6 1 1 3 71 12 11 11California II 22 1 0 0 1 2 0 2 28 5 2 5 eWhite FemalesCharlotte 73 21 19 17 14 7 16 14 181 59 66 51Birmingham 171 29 45 26 41 18 24 23 377 96 118 106NYC 38 5 7 1 2 2 3 4 62 12 15 11Utah 145 35 26 14 50 27 25 28 350 104 93 130California I 219 34 36 17 38 15 13 21 393 87 87 87California II 113 29 25 14 25 11 21 22 260 76 82 79

Other FemalesCharlotte 25 4 7 12 7 4 3 5 67 25 27 19Birmingham 108 13 29 15 19 7 9 12 212 47 65 47NYC 10 1 1 0 2 1 0 1 16 3 2 4Utah 21 3 3 0 4 4 0 1 36 8 4 9California I 39 7 5 2 9 0 0 2 64 11 9 11California II 18 4 4 1 4 0 0 1 32 6 6 5

137

as well. There is again a problem with the entries for NYC,

as the frequency for 'Other' males for 1973 is o. A

slightly different adjustment was applied in this analysis

than that of section 5.1, where .5 was added to one randomly

selected cell to produce a non-zero function value for the

NYC males subpopulation. Instead, each of the entries for

that particular subpopulation was increased by .5, and then

further adjusted by a multiplicative factor

a =8

j~1(Y123j+ .5)·

{yghij} represents the set of table entr1es~ g=1,2 for

sex,h=1,2 for race,i=1,2, ... 6 for area, and j=1,2, ... 8 for

response profile. Thus, all of the row's entries are

augmented by a small amount so that zero frequencies are

eliminated, but the marginal total is constrained to be the

original total. In this case, the adjustment factor is

a=15/19, and the numbers in parentheses in Table 5.2.1 are

the adjusted counts.

The data are assumed to have the product multinomial

distribution, so that

2Pr{y}= g~l

where n ghij represents the probability that a randomly

selected subject of the g-th sex, h-th race, and i-th area

138

has the j-th profile. The function vector whose elements

are the marginal probabilities of having 2+ colds in 1973,

1974, and 1975 is written

r r , ,I 10 1 0 1 0 1 0 11 II I I I

F(p) = Ap I 10 0 1 1 0 0 1 11 e 1 24 1 pI I I 1I 10 0 0 0 1 1 1 11 1L L oJ oJ

where p = (yghij/nghi *) 1s a 192*1 vector of sample

proportions. A consistent estimator for the covariance

matrix of F is VF=AVpA ' , where Vp is defined as in (3.2.5).

The identity model again was used to assess potential

sources of variation:

E{F(p)}=Atr=IP=P

and Table 5.2.2 contains the results of tests for those

linear hypotheses concerning the estimates of 2+ colds for

each of the subpopulations for each of the three years.

These results indicate that there are no significant

three-way interactions (Q=.05) in the data, and also that

the four-way interaction is non-significant(p=.82). The

two-way interactions which are significant are (sex x area)

(QC=14.49, d.f.=5) and (time x area) (QC=42.07, d.f.=10).

Sex, area, and race were all very significant (all p-values

<.001), while the average time effect was not (p=.4241).

Accordingly, a useful reduced model for F(p) is

E{F(p)}=Atr=X2P 2

This model incorporates the above findings by including

Table 5.2.2

Hypotheses and Resulting Test Statistics ConcerningProportions of 2+ Colds Reported in 1973,1974, and 1975

139

•Hypothesis QC d.f. P-Value

1. No difference between sexes for averages 44.15 1over area x time x race

2. No difference between races for averages 13.64 1over area x time x sex

3. No variation among areas for averages 23.48 5over time x sex x race

4. No variation among times for averages 1.72 2over sex x area x race

5. Homogeneity across race of differences .05 1between sexes for averages acrosstime x area (i.e. race x sex)

6. Homogeneity across areas of differences 14.49 5between sexes for averages acrossrace x time

7. Homogeneity across time of differences .96 2between sexes for averages acrossrace x area

8. Homogeneity across areas of differences 8.86 5between races for averages acrosssex x time

9. Homogeneity across time for differences 1.46 2between races for averages acrosssex x area

10.Homogeneity across time for differences 42.06 10among areas for averages acrosssex x race

11.No average sex x race x area interaction 6.71 5

.000

.000

.000

.424

.824

.010

.620

.115

.483

.000

.243

12.No average sex x race x time interaction .21 2 .814

13.No average sex x area x time interaction 6.63 10

14.No average race x area x time interaction 3.41 10

.160

.910

15.No sex x race x area x time interaction 5.96 10 .818

140

separate intercepts for the six areas, and separate effects

for sex, race, and time within each of the areas. The 30

parameter model X2 is shown in Table 5.2.3, along with

parameter estimates and interpretations. The goodness-of

fit is supported by the non-significance of the Wald

statistic (d.f.=42). The parameter interpretations are as

follows:

~1-~6: reference values for white males in 1973 for

Charlotte, Birmingham, NYC, Utah, California I ,

and California II, respectively

~7-~12: increments for females for each area

~13-~18: increment for 'Other' race for each area

~19-~20: increments for 1974 and 1975, for Charlotte e~21-~22: increments for 1974 and 1975, for Birmingham

~23-~24: increments for 1974 and 1975, for NYC

~25-~26: increments for 1974 and 1975, for Utah

~27-~28: increments for 1974 and 1975, for California I

~29-~30: increments for 1974 and 1975, for California II

Hypotheses concerning interactions between sex and

area, race and area, and time and area were evaluated with

Wald statistics of the form Qc=b'C'(CVbC,)-lCb . Table

5.2.4a displays the hypotheses couched in terms of the

respective parameters as well as the appropriate C matrices.

All tests proved significant (~=.05). It is again of

interest to continue modeling efforts in an attempt to

describe more finely the nature of such interactions, using

the results for model X2 as the inferential justification.

141

Table 5.2.3

Specification Matrix, Estimated Parameters, and Standard Errors forpreliminary Model X2 for 2T Colds

r 1 00 1 B steIB)I 1 10 II 1 01 I .209 .0254I 1 00 I .139 .0130I 1 10 I .120 .0331I 1 01 I .173 .0106I 1 00 I .149 .0136I 1 10 I .199 .0204I 1 01 I .148 .0263I 1 00 I .117 .0165I 1 10 I .080 .0418I 1 01 I .106 .0202I 1 00 .057 .0172I 1 10 .097 .0245I 1 01 -.054 .0258I 1 00 -.052 .0163I 1 10 -.001 .0520I 1 01 -.127 .0246I 1 1 00 -.029 .0235I 1 1 10 -.126 .0346

1 1 01 .044 .02611 1 00 -.073 .02631 1 10 .075 .01441 1 01 .024 .0142

1 1 PO -.017 .03351 1 10 .003 .03401 1 01 .008 .0191

1 1 00 .084 .01991 1 10 .010 .01511 1 01 .0lB .0152

1 1 00 .006 .02061 1 10 .015 .0213

e 1 L 011 1 001 L 101 1 01

1 1 001 1 101 1 01

1 1 001 L LOL L 01

1 1 001 L 101 1 01

1 1 00L 1 101 1 01

1 1 001 1 101 1 01

1 1 001 La

1 1 01

1 1 1 001 1 1 111

I 1 1 1 01I 1 1 1 00, 1 1 1 10I 1 1 1 01I 1 1 1 00

-, 1 1 1 10I 1 1 1 01I 1 1 1 00I 1 1 1 10I 1 1 1 01I 1 1 1 00I 1 1 1 10I 1 1 1 01

e I 1 1 1 00I 1 1 1 10 IL 1 1 1 01 J

141

These models are intended to provide a mechanism with which

one can describe the variation in the data. Efforts first

concentrate on smoothing parameterizations for components

other than time. Consideration of the estimated parameters

and their respective standard errors suggest the following

reductions be investigated:

C4 : There is equality of sex effects for Charlotte,Birmingham, NYC, Utah, and California II

There is equality of race effects for Birminghamand Charlotte, and Utah and California IIrespectively,and null race effects for NYC and California I

Since the model for 1+ asthma discussed in section 5.1

included common parameters for the two California areas,

another model simplification evaluated was whether the area

and sex parameters were equivalent in model X2 for 2+ colds:

C6 : There is equality of reference, sex, and raceparameters for California I and California II.

The contrast matrices which correspond to the above

reductions are presented in Table 5.2.4b, along with the

resulting Wald statistics. These contrasts might be

considered 'compound' in that multiple sets of contrasts are

tested simultaneously. For example, C5 tests whether two

sets of two parameters can each be replaced by one, and at

the same time whether two other effects are nUll. This is a

convenient strategy when the number of functions being

modeled is large and subsequently the number of parameters

involved in any preliminary investigative model such as X2

is also going to be considerable. Reduction-implying

142

Table 5.2.4a

Results of Linear Hypotheses Concerning the ParameterEstimates for the Model X2 for 2+ Colds

r ,I 0 0 0 0 0 0 1 0 0 0 0-1 0 0 0 0 0 0 0 0 0 0 0 a 0 0 0 a 0 0 II 0 0 0 0 a 0 0 1 0 0 0-1 0 0 0 0 0 0 0 0 0 a 0 a a a a a a a I

C1 =1 0 a a a 0 0 0 0 1 a 0-1 0 a a a a a a a a a a a a a a a a a II 0 a a a 0 0 a 0 a 1 0-1 a a 0 a a a a a a a a a a a a a a a II a a a a 0 0 0 a a a 1-1 a a 0 0 a a 0 a a a a a 0 a a a 0 0 IL. .J

H1 : 137 = 138 = 139 = 1310 = all = 131 ZQc = 11.15 p-value = .0484 d.f. = 5

r ,1 a 0 0 0 0 0 0 0 a 0 0 a 1 0 0 0 0-1 a a a a 0 a 0 0 0 o a a II a 0 a a 0 0 0 a a a 0 0 0 1 0 a 0-1 0 0 a a 0 0 0 0 a a a a I

C2 =1 a 0 a a 0 0 a 0 a a 0 a a a 1 0 0-1 0 a a a a a a a a a a a II a a 0 a a 0 a a a a a a a a a 1 0-1 a a 0 a a 0 0 a 0 o 0 a II a a a a a a a a a a a a a a a a 1-1 a a a a a a a a a a a a IL. .J

H2 : al:l = 1314 = 1315 = 1316 = 1317 = 1318Qc = 14.04 p-value = .0154 d. f. = 5

e r ,I a 0 a a a 0 a 0 a 0 a 0 0 a 0 a 0 a a a 1 a a a a 0 a 0-1 a 1I a a 0 a 0 0 a a a 0 0 a a 0 a a a a a a a a 1 a a 0 a 0-1 0 II a 0 0 a a a a a a a a a 0 a 0 a a a a a a a a a 1 a a 0-1 a II a a a a a a a a a a a a a a a a a a a a a a a a a a a 0-1 0 II a a a a a a a a a a a a a a a a a a a a a a a a a a 1 0-1 a I

C3 =1 a a a a a a a a a a a a a a a a a a a a a 1 0 a a a a a 0-1 II a a a a a 0 a a a a a a a a a a a a a a a a a 1 a a a a 0-1 II a a a a a c c a 0 0 0 0 0 a 0 a a a a a 0 a a a a a a a 0-1 I

I

I a a a 0 a a a a a a a a 0 a a a a a a a a a .0 0 a 1 a a 0-1 II a a a a a a a a a a a a a a a a a a a a a a a a a a a 1 0-1 IL. .J

H3

: 1319 = az 1 = aZJ = aZB = an = az 9 ,azo = azz = 1324 = aZ& = aZ8 = a:lO

Qc = 56.32 p-value = .0001 d.f. =10

Table 5.2.4b

Results of Tests for Simplifications Concerning ParameterEstimates for the Model X

2for 2+ Colds

143

r ,I 0 0 0 0 0 0 1 0 0 0 0-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I

C4 =1 0 0 0 0 0 0 0 1 0 0 0-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I

I 0 0 0 0 0 0 0 0 1 0 0-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 II 0 0 0 0 0 0 0 0 0 1 0-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IL. oJ

H3

: a, = ~8 = a. = alo = ~12Qc = 3.07 p-value = .5465 d. f. = 4

r ,I 0 0 0 0 0 0 0 0 0 0 0 0 1-1 0 0 o 0 0 0 0 0 0 0 0 0 0 0 0 0 I

C5 =1 0 0 0 0 0 0 0 0 0 0 0 0 o 0 0 1 0-1 0 0 0 0 0 0 0 0 0 0 0 0 I

I 0 0 0 0 0 ..0 0 0 0 0 0 0 o 0 1 0 o 0 0 0 0 0 0 0 0 0 0 0 0 0 II 0 0 0 0 0 0 0 0 0 0 0 0 o 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 II

L. oJ

H4

: ~I :l = ~14 , ~I 6 = ~I 8 , ~I 5 = 0, ~I , = 0Q

C = 1. 54 p-value = .8187 d. f. = 4

r ,I 0 0 0 0 1-1 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 0 0 0 0 0 0 0 0 II 0 0 0 0 o 0 0 0 0 0 1-1 0 0 0 0 o 0 0 0 0 0 0 0 0 0 0 0 0 0 I eC

6 =1 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 1-1 0 0 0 0 0 0 0 0 0 o 0 0 II 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 0 0 0 0 1 0-1 0 II 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 o 0 0 0 0 0 0 0 0 0 0 1 0-1 IL. oJ

H5 : ~5 = ~6' ~II = ~12' ~I , = ~I 8 , ~27 = ~2 9 , ~28 = a:l 0Qc = 18.58 p-value = .0023 d.f. = 5

144

simplifications can thus be grouped together according to

the effects their corresponding parameters are

characterizing, such as sex, race, and time. If the

simplification is contradicted, then the individual

components can be tested separately. This approach is also

sensible when one considers that the reduced model the

simultaneous reductions imply has a goodness-of-fit chi

square equivalent to that of the original model plus the

quantity equivalent to the Wald statistic for the test of

the simultaneous reduction. The test for equality of the

California parameters is clearly significant(a=.05),

indicating that the California areas can not be treated the

same as they were in section 5.1. Reductions C4 and Cs are

not contradicted and suggest that certain reductions in the

model structure are plausible.

The next step in the model-fitting for 2+ colds was to

incorporate the results of the tests for C4 and Cs into a

reduced model X3 for the 2+ colds estimates. This model is'

expressed

E{F(p)}=F{n}=X3P3

where X3 is of rank 22 and includes six reference parameters

(one for each area), twelve time parameters as in X2 , one

sex effect for California I, and one other sex effect for

the other five areas. Also in the model is one race effect

for both Birmingham and Charlotte, and one other for Utah

and California II. Columns 7,8,9,10 and 12 of X2 have been

added together, as have columns 13 and 14, and columns 16

145

and 18. Columns 15 and 17 have been deleted. Since NYC and

California do not have race effects in this model, their

respective reference values apply to both white and 'Other'

races. The goodness-of-fit for the'model X3 is QW=35.22,

(d.f.=50,p=.94). Note that the difference between the

goodness-of-fit for X3 and X2 - 35.22 - 30.52 = 4.70, and

the difference in degrees of freedom = 30 - 22 = 8. The

value 4.70 is the value of the Wald statistic for the 8

degree of freedom test for the simultaneous testing of

reductions C4 and C5 . The parameter estimates and their

standard errors are displayed in Table 5.2.5.

Table 5.2.5Estimated Parameters and Standard Errors

For Intermediate Model X3 For 2+ Colds

Referenceparameters

~1-~6Sexparameters

~6'~7Raceparameters

~8'~9Timeparameters

~10-~22

.224 .142 .107 .170 .146 .191(.214) (.0133) (.0284) (.0149) (.0139) (.0177)

.113 .054(.0101) (.0171)

-.0552 -.1269(.0137) (.0199)

.045 -.076 .075 .024 -.015 .003(.0261) (.0260) (.0144) (.0141) (.3350) (.0340)

.007 .083 .010 .017 .007 .015(.0180) (.0199) (.0151) (.0151) (.0205) (.0212)

At this point, smoothing has been accomplished for

those effects which represent among subject variation.

Time, on the other hand, is an effect which represents

within subject variation. Such variation is usually smaller

in magnitude than the among subject variation.

Consideration of the parameters reflecting the variability

146

of time for the 2+ colds estimates leads to the following

simplification:

All time effects are null except for the incrementfor Charlotte for 1975, the increment forBirmingham in 1974, and the increment forUtah in 1975.

One can proceed as before, and investigate this

hypothesis with a test of the form H :C~=O, or one can testo

this hypothesis via a less direct, athough equivalent

mechanism. As discussed in Koch, Imrey, et al. (1985),

testing linear hypotheses concerning the elements of ~ is

equivalent to specifying a linear model for ~, i.e.

~ =zt3

where Z is a known (t*(t-c)) orthocomplement to C, ~3 is the

parameter vector for the model x3 ' and t is a «t-c)*l)

vector of parameters. The goodness-of-fit test is equal to

the Wald statistic Qc for the corresponding Ho

' and it can

be shown that the goodness-of-fit statistic Qw,c for X4=X3Z

is equal to Qc plus Qw, the Wald goodness-of-fit statistic

for the original model X3 . Fitting a model Z to ~ is often

useful when formulating H in terms of a contrast matrix iso

cumbersome, but representing it as a linear model in ~ is

straightforward. Here,

147

148

QC = 8.23 (d.f.=9) with p=.52. Thus, the goodness-of-fit

for the reduced model X4 = X3Z is QWC = 35.2261 + 8.2762 =

44.49, with 50 + 9 = 59 degress of freedom. The estimated

parameter vector and its standard error vector are as

follows:

r ,.2444.1528.1016.1730.1535.1977

'( = .1144.0557

-.0551-.1270-.1004

.0649

.08021L .J

r ,.0177.0116.0201.0122.0108.0133

s.e.('()= .0100.0171.0137.0199.0219.0131.01781

L .J

These estimates have the following interpretation:

'(1: Reference value for Charlotte white males in 1973'( 2 : " Birmingham "'(3: " NYC "'(4: " Utah "'( 5: " California I "'( 6 : " California II "'(7: sex effect for all areas except California I1" 8: sex effect for California I'(9: race effect for Birmingham and Charlotte

1"10: race effect for Utah and California II1" 11 : time effect for 1975 for Charlotte1" 12: time effect for 1974 for Birmingham1" 13: time effect for 1975 for Utah

The final step in this model-reduction process, after

measures have been taken to smooth the among subject sources

of variation sex and race, and the within subject source of

variation time, was to examine whether any of the area

reference parameters could be combined. Now, it may have

been decided a priori not to include this phase in the

149

modeling process, i.e. it may make sense to include a

reference parameter for each individual area and to smooth

only the estimated parameters concerning sex, race, and

time. In that case, model X4 would stand as the 'final'

model. However, given that one has decided to proceed, the

following additional simplifications are of interest:

C1 : There is no difference among the referenceparameters for Birmingham, Utah, California I,or California II

C2 : There is no difference between the referenceparameters for Birmingham, Utah,or California I

C1 is contradicted (Qc=9.24, d.f.=3, p=.03), while C2 is not

(QC=2.04, d.f.=2,p=.36). These results imply that model X5 '

which is of rank 11, is suitable. The specification matrix

for X5 is shown in Table 5.2.6, along with parameter

estimates and their standard errors. The goodness-of-fit is

QW=45.49 (d.f.=61), with a p-value of .93.

Another tool with which one can assess the

appropriateness of individual model parameters is residual

analysis. While this has been a prime focus of regression

diagnostics in the application of linear models to

continuous data for many years, it has remained relatively

unused in the analysis of categorical data. Vicki Davis

reviews residual analysis for categorical data in the

context of weighted least squares procedures (Davis, 1984).

One can choose to explore residuals at either the function

level or the probability level; the former may allow one to

Table 5.2.6

Specification Matrix, Estimated Parameters, and Standard Errors forPreliminary Model X5 for h Colds

I 0 B steIB)I 01 1 .245 1.017

1 0 .159 1.0071 1 .101 1.0201 0 .196 1.013

1 .116 1.0101 .050 1.015I -.060 1.012

1 0 .118 1.0191 0 .100 1.022I I .062 1.0131 0 .089 1.0171 01 0

00 Parameter Interpretations0

1 1 0 Bl: Reference value for CharlotteI 1 0 B2: Reference value for Birmingham, Utah,1 1 1 and California I

1 1 0 B3: Reference value for NYC1 1 1 B~: Reference value for California II1 1 0 B5: Increment for females in all areas

1 except California I1 B6: Increlllent for females in Cal ifornia I1 B7: Increment for 'Other' race in Birmingham

1 1 0 and Charlotte1 1 0 I B8: Increment for 'Other' race in Utah and1 1 I I Cali fornla IIe 1 0 I B9: Increment for 1975 n Charlotte1 0 1 BIO: Increment tor 1974 n Birmingham1 0 I BlI: Increment for 1975 n Utah

1 1 0 I1 1 0 11 1 0 I

I 1 1 I 0 11 I 1 1 0 II 1 I 1 1 II I 1 0 11 1 I 1 II 1 1 0 1I 1 1 II 1 1 II 1 I 1I 1 1 0 II 1 I 0 II 1 I 1 II 1 1 0 II 1 1 0 1I 1 1 0 II 11 0 1I 11 0 II 11 0 II 1 I 1 0 II I I 1 0 II I I 1 1 II 1 1 I 0 ,I 1 I I 1 II 1 1 1 0 II I 1 II I I II 1 I II I 1 0 II I 1 0 IJ 1 I I II I I 0 II I I U II 1 1 0 II 11 I 0 II 11 1 0 IL 11 I 0 J

149

150

assess the compatibility of the data with the chosen model,

while the latter may expose irregularities of the model in

terms of the underlying contingency table. At the functionA ~

level, the residuals are defined as F-F, where F=X~. A~

consistent estimator for the covariance matrix of (F-F) can

be written

VF_~=VF-VF=VF-XVbX'

The general linear model specification E{F(p)}=X~ can be

written as

F(p)-X~ + ,

where , represents a vector of unknown true error terms,

assumed to be multivariate normal with mean vector 0 and a

known covariance matrix V. (Unlike the ordinary least

squares regression, the , are not assumed to be

independent.)Given that a particular model is adequate for~

F(p), the residuals (F-F) should also be approximately

multivariate normal. Accordingly, one can compute~

standardized residuals as the individual elements (F-F)

divided by their respective error terms, and assume

normality with 0 mean and unit variance. By calculating

the p-values for these z-scores, one can spot extreme values

which may require additional attention. Also, if large

numbers of the p-values are significant, either via an

~=.10 criterion or a more conservative Bonferroni criterion,

then one may conclude that the model is inappropriate for

many functions and needs to be amended to incorporate

additional sources of variation. Standardized residuals

151

were calculated for the 2+ colds models X2 (30 parameter),

X3

(22 parameter), and X5 (11 parameter) as an additional

method to assess their effectiveness in characterizing the

data. Specifically, it allows one to see whether the

predicted values of any particular functions are adversely

affected by the model reductions which took place. Table

5.2.7 contains the residuals, their standard errors,

standardized residuals, and the corresponding p-values for

models X2 ,X3 , and X5 . X2 is the preliminary model which

included race, sex, and time effects within area. There

were 30 parameters altogether. There are four residuals

which stand out in the sense that their p-values are < .10.

These correspond to the predicted marginal probability of 2+

colds for white males in Utah in 1974 and California II in

1975, and 'Other' males in 1973. 3ust one of these

residuals is significant at the a=.05 level of significance.

This is re-assuring, since one would expect between 3 and 4

of the 72 tests to be significant by chance alone. If one

applies the Bonferroni correction to the tests, the error

rate of a=.05/72=.00070 is not met by any of the p-values.

Table 5.2.7 also contains the residuals and standardized

residuals for the 22 parameter model, which includes

smoothing for some race and sex effects. With this model,

two additional predicted marginal probabilities become

'prominent' in terms of having corresponding p-values less

than .1. These are the estimates for 2+ colds for 'Other'

females in Charlotte in 1973 and in California I in 1974.

Table 5 .. 2.7

Residuals and P-values for Models Xz, X3 and Xsfor Charlotte and California I

152

Charlotte Model X2 Model X3 Model X5

Sex Race Year Res. p Res. p Res. p

Male White 1973 .005 .852 -.010 .712 -.032 .291Male White 1974 .026 .341 .010 .738 .034 .323Male White 1975 .026 .213 .014 .575 .016 .545Male Other 1973 .002 .856 -.001 .978 -.017 .370Ma'le Other 1974 .017 .345 .014 .476 .010 .620Male Other 1975 .002 .914 -.001 .931 .006 .757

Female White 1973 .007 .781 .020 .524 .026 .482 eFemale White 1974 -.024 .116 -.013 .558 -.022 .429Female White 1975 .020 .476 .033 .324 .042 .287Female Other 1973 -.017 .158 -.014 .308 -.003 .894Female Other 1974 .023 .087 .026 .098 .044 .038Female Other 1975 -.007 .663 -.004 .832 .002 .990

California I Model X2 Model X3 Model X5

Sex Race Year Res. p Res. p Res. p

Male White 1973 .016 .207 .022 .111 .012 .454Male White 1974 .006 .659 .012 .399 .012 .454Male White 1975 -.002 .868 .004 .759 .012 .454Male Other 1973 -.003 .851 -.012 .573 -.019 .439Male Other 1974 .013 .434 .004 .861 .004 .878Male Other 1975 -.007 .704 -.015 .480 -.008 .762

Female White 1973 .070 .156 .091 .092 .072 .199Female White 1974 .055 .275 .076 .167 .102 .074Female White 1975 .053 .254 .078 .130 .082 .114Female Other 1973 .017 .439 .021 .384 .007 .780Female Other 1974 .027 .286 .031 .253 .030 .277Female Other 1975 -.007 .758 -.003 .905 .007 .780

153

The p-value for the residuals corresponding to white males

in California I in 1973 is now .15. Only one of the 72

tests is significant at the a=.05 level of significance,

which leads credence to the overall acceptability of the

model X3 .

Finally, note the residuals and associated statistics

for the final 11 parameter model. Despite a marked decrease

in the number of characterizing parameters, the residuals

for X5

include only five with p-values less than .1 and only

one with a p-value significant at the a=.05 level of

significance. The residual for White males in Utah in 1974

is significant at a=.05 (p=.038), while the residuals for

White males in California I in 1973 again has a p-value

below .10. The other residuals with p-values below .10

include 'Other' males in Charlotte in 1975, 'Other' females

in Charlotte in 1974 and in California I in 1974. Since

there are not any extreme 'misses' so far as predicted

functions go, nor any obvious pattern for the residuals with

the smaller p-values with respect to race, sex, time, or

area, one can conclude that the residual analysis for the 11

parameter model supports the adequacy of the model as

implied by its goodness-of-fit statistic. The original

marginal probabilities for the model X5 as well as the

predicted marginal probabilities are displayed in Table

5.2.8.

Table 5.2.8

Observed and Predicted Marginal Probabilities of 2+ Coldsin 1973, 1974 and 1975 with Final Model X5

154

White Other White OtherMales Males Females Females

Area Year Obs Pre Obs Pre Obs Pre Obs Pre

1973 .213 .246 .175 .186 .326 .361 .373 .301Charlotte 1974 .279 .246 .206 .186 .365 .361 .403 .301

1975 .162 .146 .048 .086 .282 .262 .284 .202

1973 .142 .159 .080 .099 .255 .274 .222 .215Birmingham 1974 .231 .221 .150 .161 .313 .337 .307 .277

1975 .165 .159 .113 .099 .281 .274 .22'2 .215

1973 .127 .101 .105 .101 .194 .217 .188 .217NYC 1974 .794 .101 .158 .101 .242 .217 .125 .217

1975 .143 .101 .158 .101 .177 .217 .250 .217

1973 .156 .159 .034 .040 .297 .274 .222 .156 eUtah 1974 .203 .159 .103 .040 .266 .274 .111 .156

1975 .250 .248 .138 .129 .371 .363 .250 .245

1973 .133 .159 .169 .159 .221 .209 .172 .209California 1974 .164 .159 .155 .159 .221 .209 .141 .209

I 1975 .171 .159 .155 .159 .221 .209 .172 .209

1973 .188 .196 .179 .077 .292 .311 .188 .193California 1974 .212 .196 .071 .077 .315 .311 .188 .193

II 1975 .219 .196 .179 .077 .304 .311 .156 .193

155

5.3 Linear Models Analysis for 1+ Asthma for the Area x Sex

Cross-classification for 1973, 1974, and 1975 Combined

One approach to the analysis of the outcome variables

in this dataset that does not take into account the repeated

measurement structure of the data is to combine the data for

the years 1973, 1974, and 1975. One could choose to model

the variation among a demographic cross-classification of

those subjects who reported 1+ asthma for at least one of

the years 1973, 1974, or 1975 versus those who reported no

asthma at all for each of the three years. In order to

compare the results of such an analysis with those reported

in section 5.1, a linear models analysis of those sUbjects

who reported at least one occurrence of asthma symptoms in

the study years 1973-1975 versus those who reported none for

the cross-classification of (area x sex) was completed.

Table 5.3.1 contains the relevant frequencies.

Table 5.3.11 or More Asthma Events Reported in 1973, 1974, or 1975

Area Sex None At Least One

Charlotte Male 185 14Birmingham Male 461 55NYC Male 76 2Utah Male 319 30California I Male 453 69California II Male 249 39Charlotte Female 229 19Birmingham Female 543 46NYC Female 73 5Utah Female 353 33Cailfornia I Female 418 39California II Female 275 17

Thus, the objective of this analysis is to model the

variation among the proportions of subjects who reported 1+

156

asthma in 1973-1975 for the (sex x area) subpopulations

displayed above. Let Yhij represent the frequency of

subjects in the h-th area(h=1,2, ... 6) for the i-th sex

(i=l=Male, i=2=Female), for the j-th response(j=l=No asthma,

j=2=1+ asthma). Then, the function of interest can be

written

F(p) = Ap = [ 0 1 ] e I 12 * P

where P=(Yhij / Yhi *) is the (24 x 1) vector of sample

proportions. Given that the {Yhij } can be considered to

have the product multinomial distribution, one can proceed

to pursue a weighted least squares analysis of F(p) and its

covariance matrix VF=AVpA~

In previous sections, the cell mean (identity) model

was first used in conjunction with a series of hypothesis

tests to produce a preliminary assessment of variation. The

subsequent results usually indicated the direction to take

in forming descriptive models. Since the point of this

section is to compare the results with those of section 5.1,

it was decided to begin with a saturated model which

includes six reference parameters for the six areas and six

additional parameters for the incremental effect of sex

within each of the areas. This model can be written

E{F(p)} = An = Xl~

where

157

r ,1 0 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0 0

X = 0 0 0 0 0 1 0 0 0 0 0 01 1 0 0 0 0 0 1 0 0 0 0 00 1 0 0 0 0 0 1 0 0 0 00 0 1 0 0 0 0 0 1 0 0 00 0 0 1 0 0 0 0 0 1 0 00 0 0 0 1 0 0 0 0 0 1 00 0 0 0 0 1 0 0 0 0 0 1

L

~1'~2'~3'~4'~5 and. ~6 represent reference parameters for

males in areas Charlotte through California II, and

parameters ~7'~8'~9'~10'~11' and ~12 denote incremental

effects for sex within areas Charlotte through California

II. This is a saturated model, and as such there is no

e goodness-of-fit statistic defined. Table 5.3.2 contains the

estimated parameters and their standard errors.

Table 5.3.2Estimates and Standard Errors for Model Xl for

1+ Asthma in the Years 1973-1975 Combined

Paramter Estimate Standard Error

~1 .070 .0181~2 .107 .0136~3 .026 .0179~4 .086 .0150~5 .132 .0148~6 .135 .0202~7 .006 .0248~8 -.029 .0175~9 .038 .0330

~10 -.001 .0207

~11 .042 .0199

~12 -.077 .0244

The hypothesis test for no (sex x area) interaction

results in a QC=12.01 with 5 d.f., which is non-supportive

158

of the hypothesis(p=.03). Further testing was done in order

to investigate potential smoothing across the sex effect

parameters. Specifically, the following simplifications

were evaluated:

C1 : ~7=~8=~9-~10-0,

or the hypothesis of null sex effects for Charlotte,

Birmingham, NYC and Utah. The Wa1d statistic for this test

is non-significant, QC=4.07 with 4 d.f. (p=.40). The

implied reduced model X2 was fit, and further reduction

hypotheses concerning the reference parameters led to a

final model X3 , which is displayed in Table 5.3.3. This

final model includes reference parameters for NYC,

California I, California II, and a common reference

parameter for Charlotte, Birmingham, and Utah combined. In

addition, sex effects included are those for NYC, California

I and California II. The estimates, standard errors, and

observed and predicted functions for the model

E{F(p)}= X3~

are also displayed in Table 5.3.3.

The Wa1d statistic for the goodness-of-fit for model X3 is

QC=5.19 with 6 d.f. The p-va1ue for this chi-square is

.5214. Thus, the model renders an adequate description of

the variation among the estimates of 1+ asthma in the (sex x

area) subpopu1ations. A comparison with the final model for

1+ asthma for 1973, 1974, and 1975 discussed in section 5.1

shows similarities. Both models have similar reference

parameters for California I and California II; however, this

section's analysis did not include smoothing them

Table 5.3.3

Estimates, Standard Errors, Observed and PredictedFunctions for the Model X3 for 1+ Asthma in 1973-1975

Specification Matrix Observed PredictedProportions Proportions

1 0 0 0 0 0 0 .070 .0851 0 0 0 0 0 0 .107 .085

e 0 1 0 0 0 0 0 .026 .0371 0 0 0 0 0 0 .086 .0850 0 1 0 0 0 0 .132 .1330 0 0 1 0 0 0 .135 .1351 0 0 0 0 0 0 .077 .0851 0 0 0 0 0 0 .078 .0850 1 0 0 1 0 0 .064 .0371 0 0 0 0 0 0 .085 .0850 0 1 0 0 1 0 .085 .0900 0 0 1 0 0 1 .058 .058

Estimates and Standard Errors

13 1 : reference value for Charlotte, .085 ± .005Birmingham and Utah

13 2 : reference parameter for NYC .037 ± .01513 3 : reference parameter for Cali I .132 ± .01513 4 : reference parameter for Cali II .135 ± .02013 4 : sex increment for California I -.042 ± .02013 5 : sex increment for California II -.077 ± .024

159

160

into one since the California areas had different sex

effects. Both analyses resulted in a single reference

parameter for NYC. However, in the latter analysis,

reference parameters for the remaining areas could be

smoothed into one, while the earlier modeling effort

maintained separate reference terms for Charlotte,

Birmingham, and Utah. Both models include negative sex

effects .for California; however, the model for the years

combined has a term for the individual California areas

while the earlier model has one parameter for both

California areas. Finally, there can be no variation due to

time for the analysis in this section since the responses

for each year are summed together. There is a significant

effect for 1915 for all males in the earlier analysis.

It should be realized that sections 5.1 and 5.3 are

concerned with different response outcomes. The outcome of

5.3 is the proportion of 1+ asthma for 1913, 1914, and 1915

separately. Since time is a dimension across which the

response measures are determined, the regression set-up

employed also allows it to be modeled as if it were an

independent effect. The outcome measure of section 5.3 is

the proportion of subjects who had 1+ asthma over the years

1913-1915 combined. This quantity has an entirely different

meaning. However, the comparison of analysis efforts is

warranted since section 5.3 illustrates a direction modeling

efforts might take if time were not considered and provides

a benchmark against a modeling effort which does include

161

time. Other response outcomes which may have been

considered in alternative analyses are the proportion of 1+

asthma in strictly 1973, 1974, or 1975. However, lessening

sample sizes may have made it difficult to have the

expected frequencies of 1+ asthma greater than the desired 5

in each of the subpopulations. In summary, there was a

significant, although limited effect for time in the

analysis which included time in its modeling framework.

This seems to indicate its advantage over an analysis fairly

similar in nature which does not allow for a time effect.

5.4 Linear Model Analysis of Mean Colds for

1973,1974, and 1975

Another response measure of interest is the mean

number of colds for the years 1973, 1974, and 1975. The

functions can be calculated via an A matrix applied to the

proportion vector resulting from a contingency table.

kHowever, r=d =64 response profiles (k=O,1,2,3) can lead to

cumbersome matrix manipulations, and it is much more

efficient to turn to computations on the individual case

records in order to produce the appropriate function vector

of mean colds for 1973, 1974, and 1975. Specifically, let

Yikl represent the outcome for the d-th cat~gorical variable

for the l-th subject in the i-th subpopulation, where

1=1,2, .. ,ni *, and n i * is the number of subjects in the i-th

subpopulation. Then let Yil = (Yi1l'Yi2l' .. 'Yidl)' denote

the vector of responses for the l-th individual in the i-th

subpopulation. The sample mean for the k-th response

162

variable in the i-th subpopu1ation can thus be written

Yik

=1

ni*

I:1=1

Yik1

It follows from Central Limit Theory that the vector

Yi=(Yi1'Yi2'Yi3), will be multivariate normal far large

enough n i * (~20), with a covariance matrix estimated by

1= --

ni*

1:i

1: i is the covariance matrix for the case records {Yi1}' and

it is consistently estimated as

11: = --I: (Yi1 - Yi )(Yi1 - yi ,)'

i ni*

For this analysis, the subpopu1ations investigated

were those chosen by the variable selection procedure for 2+

colds in section 4.5. Thus, there will be 24 subpopu1ations

(i=1, .. ,24) and 3 response functions per subpopu1ation---the

mean number of colds in 1973, 1974, and 1975. The estimated

functions are displayed in Table 5.4.1. Since there were no

a priori restrictions concerning the structure of the model

163

Table 5.4.1

Mean Colds in 1973,1974, and 1975By Area, Sex, and Race

Subpop'li.lation 1973 1974 1975

Charlotte White Male .890 .831 .691Charlotte White Female 1.099 1.154 1.010Charlotte Other Male .778 .778 .381Charlotte Other Female 1.149 1. 224 1.015Birmingham White Male .627 .828 .716Birmingham White Female .915 1.066 .966BIrmingham Other Male .366 .582 .512Birmingham Other Female .863 1.019 .783NYC White Male .524 .524 .524NYC White Female .710 .806 .758NYC Other Male .400 .467 .600NYC Other Female .688 .688 1.000Utah White Male .669 .825 .869Utah White Female .951 .991 1.191Utah· Other Male .379 .483 .690Utah Other Female .722 .667 .917California I White Male .588 .667 .672California I White Female .819 .832 .822

e California I Other Male .648 .606 .620California I Other Female .594 .625 .672CaliforniaII White Male .738 .792 .827CaliforniaII White Female .969 1.042 1.012CaliforniaII Other Male .571 .250 .571CaliforniaII Other Female .625 .688 .781

164

for mean colds, hypotheses concerning the variation among

the mean cold estimates were again investigated with the

cell mean model:

E{P} == An = I~ = ~

The results of this preliminary phase are shown in Table

5.4.2. There are no significant four or three-way

interactions. There is a (time x area) interaction, as the

Wald test statistic is QW == 62.01 (d.f.=10). Also, there is

a significant (sex x area) interaction (Qw = 15.90, d.f.=5),

and a marginally significant (race x area) interaction,

using a=.05 as the significance level criterion.

These initial efforts at assessing the important

sources of variation among the estimates led to the

conclusion that an appropriate structure for additional emodeling would be one in which there would be modules for

each area, with effects for time, sex, and race within each

area. This model can be expressed as follows:

E{P} = An == X2~2'

where X2 = x 2 e 16 , and x 2 is written:

r ,1 0 0 0 011 0 0 1 011 0 0 0 111 0 1 0 011 0 1 1 011 0 1 0 11

x == 1 1 0 0 012 1 1 0 1 011 1 0 0 111 1 1 0 011 1 1 1 011 1 1 0 11

L. .J

165

Table 5.4.2

Hypotheses and Resulting Test Statistics ConcerningMean Colds Reported in 1913, 1914, and 1915

• Hypothesis QC d.f. P-Value

1. No difference between sexes for averages 69.41 1over area x time x race

2. No difference between races for averages 24.18 1over area x time x sex

3. No variation among areas for averages 31.23 5over time x sex x race

4. No variation among times for averages 5.19 2over sex x area x race

5. Homogeneity across race of differences .42 1between sexes for averages acrosstime x area

6. Homogeneity across areas of differences 15.90 5between sexes for averages acrossrace x time

1. Homogeneity across time of differences .28 2between sexes for averages acrossrace x area

8. Homogeneity across areas of differences 10.18 2between races for averages acrosssex x time

9. Homogeneity across time for differences 1.86 2between races for averages acrosssex x area

10.Homogeneity across time for differences 62.01 10among areas for averages acrosssex x race

11.No average sex x race x area interaction 1.18 5

.000

.000

.000

.015

.517

.001

.810

.056

.394

.000

.208

12.No average sex x race x time interaction .23 2 .891

13.No average sex x area x time interaction 12.15 10

14.No average race x area x time interaction 9.22 10

.215

.512

15.No sex x race x area x time interaction 1.53 10 .674

166

The goodness-of-fit Wald statistic for this model is QW =

43.15 (d.f.=42), indicating an adequate fit. Table 5.4.3

contains the parameter estimates for this model as well as

their standard errors.

Table 5.4.3Parameter Estimates and Standard Errors for the Area

Modules Model Xl for Mean Colds in 1973,1974, and 1975

ParametersEffect Charlotte Birmingham NYC Utah Calif.I Calif. II

Area .834 .611 .493 .679 .623 .752Sex .349 .317 .233 .257 .158 .241Race - .104 -.178 -.056 -.294 -.108 -.3591974 .020 .184 .031 .093 .038 .0171975 - .190 .069 .061 .229 .045 .056

Standard ErrorsArea .053 .035 .078 .038 .031 .045Sex .059 .038 .094 .047 .040 .057Race .062 .039 .092 .070 .055 .0841974 .051 .030 .074 .039 .032 .0421975 .051 .030 .082 .041 .033 .048

Interactions which surfaced in the previous

investigation were confirmed in this model framework. The

(sex x area) interaction was significant (0=.05) with a Qc =

11.33 (d.f.=5). Similarly, the (race x area) interaction

was significant (Qc = 12.20, d.f.=5) as was the (time x

area) interaction (Qc = 66.65, d.f.=10). Modeling efforts

continued in an attempt to describe the nature of these

interactions by working within the area module structure.

The simplification of equality of reference parameters for

California I and California II was first tested, and

resulted in a Wald statistic of QC=5.46 (1 d.f.). This

reduction is thus contradicted, signifying that the two

Californias can not share a reference parameter. Attention

167

then focused on assessing whether certain increments for

time which appeared to be marginal from their size and

standard error were indeed null. A test for whether the the

increment for 1974 for Charlotte, and both the increments

for 1974 and 1975 for NYC and both Californias were

simultaneously null was performed. This 7 d.f. test

resulted in a test statistic of Qc=4.35, which is supportive

of the simplification.

The contrast tests then focused on simplifications for

the effects pertaining to race and sex. The simplification

of equality of sex increments for the Californias was not

contradicted, as the corresponding test resulted in a Qw of

1.41 (d.f.=l). A subsequent reduction that the sex effects

for the remaining areas be combined into one was also not

contradicted, as the 1 d.f. test resulted in a Qw of .36.

In addition, subsequent testing indicated that the race

effects for NYC and California I to be effectively null, and

equivalence of race effects for Charlotte and Birmingham,

and Utah and California II to be viable. All tests

regarding the parameter estimates produced for model Xl are

summarized in Table 5.4.4.

These hypotheses imply that further model reduction is

appropriate in the analysis of mean colds for 1973, 1974,

and 1975. However, it should be noted that to go from the

preceeding hypothesis tests to a model incorporating all of

the results is an aggressive approach in comparison to the

analysis performed for 2+ colds. In that situation, several

168

Table 5.4.4Hypothesis Tests Pertaining to the Parameters

For the Area Module Model Xl

Hypothesis D.F. P-value

1.1974 time effect for Charlotte, 4.35NYC,California I and II, and 1975effects for NYC, California I and IIare null

2.Reference parameters for California I 5.46and California II are equivalent

3.Sex effects for California I and 1.41California II are equivalent

4.Sex effects for Charlotte, Birmingham 2.26NYC, and Utah are equivalent

5.Race effects for NYC and California I 4.25are null

6.Race effects for Charlotte and 1.04Birmingham are equivalen -

7.Race effects for Utah and .36California II are equivalent

7

1

1

3

2

1

1

.739

.019

.234

.519

.119

.307

.550

stages of hypothesis-testing and model-fitting were employed

in order to come to a satisfactory descriptive model.

Fitting the model implied by the aggregate of the hypothesis

tests in Table 5.4.4 will not necessarily result in an

adequate goodness-of-fit. The change in the goodness-of-fit

statistic is a function of the test statistic for the

simultaneous test of the non-significant hypotheses in Table

5.4.4, and can't be determined on the basis of the

individual test results.

A 15 parameter model was fitted to the data. Table

5.4.5 contains the specification matrix, the estimated

parameters and their corresponding standard errors. The

model is still in area module form, although the modules for

NYC, California I and California II consist of reference

parameters only. The modules for Birmingham and Utah

include reference parameters as well as incremental effects

Table 5.'+.5 ....

Specification Matrix for Model X2 forMean Colds in 1973, 1974 and 1975

SPECIFICATION MATRIX Xz100 0 0 0 0 0 000 0 0001 0 0 0 0 0 0 000 0 0 0 0 01 1 0 0 0 0 0 0 0 0 0 0 0 0 0100 000 0 0 000 0 0 1 01 000 0 0 0 0 000 0 0 1 01 1 0 0 0 0 0 0 0 0 000 1 01 000 0 0 0 0 000 1 000100 0 0 0 0 0 000 1 0001 1 000 0 0 0 000 1 0001 000 0 0 0 000 0 1 0 1 01 0 0 0 0 0 0 0 000 1 0 1 01 1 000 0 0 000 0 1 0 1 0o 0 100 0 0 000 000 0 0001 1 0 0 0 0 0 0 0 000 0o 0 1 0 1 0 0 0 0 0 0 0 0 0 0o 0 1 0 0 0 0 0 0 0 0 0 0 1 0001 100 0 0 0 0 0 0 0 1 0o 0 101 0 0 000 0 0 0 1 0o 0 100 0 0 000 0 100 0001 1 0 0 0 0 000 1 000o 0 1 0 1 0 0 0 000 1 000o 0 100 0 0 0 000 1 0 1 0001 1 0 0 0 0 000 1 0 1 0o 0 1 0 1 0 0 0 000 1 0 1 000000 1 0 0 0 0 0 0 0 0 000000 1 0 000 0 0 0 0 0o 000 0 1 0 000 0 0 0 0 0o 000 0 1 0 0 0 0 0 0 0 0 0o 0 0 0 0 1 0 000 0 0 0 0 0o 0 0 0 0 1 0 0 0 0 0 0 000o 0 0 0 0 1 0 000 0 1 000o 0 0 0 0 1 0 000 0 100 0o 0 0 0 0 1 0 000 0 1 000o 0 000 1 0 0 0 0 0 1 000o 0 000 1 0 0 0 0 0 1 000o 0 0 0 0 1 000 0 0 100 0o 000 0 a 1 0 0 0 0 0 0 0 0o 0 0 000 1 100 0 000 0o 000 0 0 1 0 1 0 000 0 0o 0 0 0 0 0 1 000 0 000 1o 000 0 0 1 100 0 0 001a 0 0 000 1 0 1 0 0 0 0 0 1o 0 0 0 0 0 1 000 0 1 000o 0 0 0 001 100 0 100 0o a 0 0 0 0 1 0 100 1 000o 0 0 0 0 0 1 0 0 0 0 1 0 0 1-o 0 0 0 0 a 1 100 0 100 1o 0 0 0 0 0 1 0 100 1 001o 0 0 0 0 0 0 001 0 000 0o 0 0 0 000 0 0 1 0 0 0 0 0o 000 0 000 a 1 0 0 0 0 0o a 0 0 0 0 0 001 000 0 0000 0 a 0 a 001 0 0 000o 0 0 0 0 0 0 0 0 1 0 0 000o 0 0 0 0 0 0 001 0 0 100o 0 0 0 0 0 0 0 0 100 100a 0 a 0 a 0 0 0 a 100 1 0 0o a 0 a 0 0 a 0 0 100 100o a 0 0 0 a 0 0 0 1 0 0 100o a 0 0 0 0 0 0 0 1 0 0 100o 0 0 a 0 0 0 000 1 0 0 0 0a 0 0 0 0 0 0 000 1 000 0o 0 0 0 0 0 0 000 1 0 a 0 0000 0 0 0 0 000 1 000 1o 0 000 0 0 0 0 0 100 0 1o 0 000 0 0 000 100 0 100000 0 0 0 0 0 1 0 100o 0 0 0 0 0 0 0 0 0 1 0 100o 0 0 0 0 0 0 0 0 0 1 0 100o 0 0 0 0 0 0 0 0 0 1 0 10100000 0 0 0 0 a 1 0 101o 0 0 0 0 0 0 000 1 0 101

169

170

Table 5.4.5b

Parameter Estimates and Standard Errors for ModelX2 for Mean Colds in 1973, 1974 and 1975

Parameter Interpretation~1: Predicted value for Charlotte~2: Incremental value for 1975 for Charlotte~3: Predicted value for Birmingham~4: Incremental value for 1974 for Birmingham~5: Incremental value for 1975 for Birmingham~6: Predicted value for NYC~7: Predicted value for Utah~8: Incremental value for 1974 for Utah~9: Incremental value for 1975 for Utah~10: Predicted value for California I~11: Predicted value for California II~12: Incremental value for females (except Cal I & II)~13: Incremental value for females in Cal I & II~14: Incremental value for 'other' in Char & Birm~15: Incremental value for 'other' in Utah & Cal II


d.f. = 57Qw = 70.25

.774

.121

.612

.185

.069

.475

.664

.090

.226

.624

.796

.300

.178-.165-.313

± .034± .044± .031± .030± .030± .044± .034± .039± .041± .024± .032± .025± .032± .033± .053

p-value = .11

171

for years 1974 and 1975. The module for Charlotte consists

of a reference parameter and an incremental effect for 1975.

There are four remaining parameters. These include an

overall parameter for the increment for females for the

California areas, and a similar parameter for all the other

areas. There are also two overall parameters for race. One

is an effect for Charlotte and Birmingham. The other is an

effect for Utah and C9 1ifornia II. The Wald statistic for

the goodness-of-fit of the model is QW = 70.25, with 57 d.f.

The associated p-value is .11, indicating that the model is

marginally appropriate.

So, although the goodness-of-fit for this model may be

considered marginal, at least in comparison with the

criteria for other models discussed in this chapter, the

model does render an adequate description of the 72

estimates of mean colds in terms of 15 parameters. Since

the point of all these analyses is the description of the

CHESS data rather than inference and decision-making, one

has greater freedom in the model-fitting process in terms of

criteria and appropriateness than in a strict inference

setting. What this model means is that mean colds can be

explained with varying number of parameters, depending on

the area. Mean colds in NYC can be explained with two, a

reference parameter of .475, and an increment for females of

.300. Thus, the reference parameter reflects the mean colds

for all races, as well as for all three years. A look at

the model structure for Utah reveals the most complicated

172

parameterization. The reference parameter estimate of .664

is the predicted value for white males in 1973. There are

additional Utah-specific increments for the years 1974 and

1975. There are also incremental effects for females and

'Other' races.

Thus, this model accounts for area and time

interaction by including area-specific effects for time for

Charlotte, Birmingham, and Utah. Sex and race are accounted

for with two overall incremental effects each. The

increment for in non-California areas is fairly substantial

(.300), when one notes that the reference values range from

.475 to .796. The corresponding increment for females in

either California area is roughly half as much. Both race

effects can be considered 'decremental' in that they are

negative. One final note is that it might be possible to

smooth some of the area reference parameters together,

particularly those for Birmingham and California I.

However, the objective of modelling is not necessarily to

seek out the most succinct structure which can be allowed by

the least stringent goodness-of-fit criterion. Instead, a

parallel consideration should be to find a model which makes

sense from a substantive point of view and exhibits

structural clarity. There is often a point where further

parameter reductions induce complication rather than

clarity.

CHAPTER VI

ANALYSIS OF INCOMPLETE DATA

6.1 Introduction

The previous chapter was concerned with the analysis

of the complete data, defined as those observations which

had information for each of the three years studied. After

variable selection procedures were performed to determine

the subpopulations which exhibited the most variation for

the outcome measures 1+ asthma and 2+ colds for 1913, 1914,

and 1915, linear model analyses were performed to determine

the relationship of these response measures to time and the

selected independent variables. Mean colds in 1973, 1974,

and 1975 were also analysed.

By including incomplete observations into subsequent

analyses, one is able to increase sample sizes greatly.

Since there are 9806 observations with data for 1973, 10162

observations with data for 1914, and 11560 observations with

data for 1975, the sample sizes for estimates derived from

these subsets are at least doubled when compared to those of

the complete data. One way to utilize these additional data

is to pursue separate univariate analyses for the three

years. Section 6.2 is concerned with developing linear

models to describe the variation among sex and area

subpopulations for 1+ asthma for the years 1973, 1974, and

175

1975 considered separately.

The other way to include all possible data in analyses

is to apply missing data adjustments in conducting

multivariate analyses. The remainder of this chapter

addresses the use of missing data strategies discussed in

Section 3.5 in the multivariate analyses of selected

response measures. The technique of supplemental margins is

applied in the analysis of the proportion of subjects

reporting 1 or more colds in Section 6.3. Multivariate

ratio estimation is used for the same (area x sex) framework

in order to compare the two strategies. Section 6.4 is

concerned with the analysis of 1+ asthma for the years 1973,

1974, and 1975 through multivariate ratio estimation as the

method of adjusting for missing data. Mean colds in 1973,

1974 and 1975 are the response functions analysed with

multivariate ratio estimation in section 6.5. A discussion

of the relative difficulties and merits of the two different

missing data strategies concludes Chapter VI.

6.2 Univariate Analysis of 1+ Asthma in 1973, 1974, and 1975

Table 6.2.1 contains the estimates of 1+ asthma for

the years 1973, 1974, and 1975. These estimates are based

on those subjects who had valid data for those years,

regardless of which data pattern group they match ---i.e.

singles, doubles, or triples. Thus, some of the

observations on which the estimates for 1973 are based may

not have values for 1974 or 1975, and may be considered

incomplete data vectors. Weighted least squares analyses

Table 6.2.1

Estimates of 1+ Asthma for 1973, 1974 and 1975and Standard Errors by Sex and Area

1 9 7 3 (n=9806) StandardArea Sex Estimates Errors

Charlotte M .0556 .0090Birmingham M. .0663 .0063NYC M .0539 .0013Utah M .0499 .0077California I M .0886 .0089California II M .0932 .0108Charlotte F .0432 .0078Birmingham F .0496 .0057NYC F .0493 .0012Utah F .0444 .0072California I F .0534 .0074California II F .0492 .0086

1 9 7 4 (n=10762)

e Charlotte M .0600 .0094Birmingham M .0567 .0052NYC M .0567 .0099Utah M .0506 .0078California I M .0745 .0086California II M .0101 .0116Charlotte F .0480 .0084Birmingham F .0471 .0049NYC F .0441 .0092Utah F .0338 .0065California I F .0554 .0078California II F .0580 .0098

1 9 7 5 (n=11560)Charlotte M .0734 .0089Birmingham M .0753 .0059NYC M .0588 .0104Utah M .0628 .0860California I M .0110 .0996California II M .0110 .0117Charlotte F .0609 .0081Birmingham F .0682 .0057NYC F .0495 .0101Utah F .0526 .0081California I F .0701 .0082California II F .0585 .0096

176

177

were carried out for each year as though the data for each

year constituted a separate dataset. The analysis for 1973

is based on 9806 observations, the 1974 analysis is based on

10,762 observations and the analysis for 1975 is based on

11,560 observations. Some observations are included in just

one year's analysis, some in two years, and still others in

all three years (those which were included in the 'complete

data' analysed in Chapter V.)

First, consider the analysis for the 1973 data. Let

Yil = 1 if asthma is reported in 1973

== 0 otherwise

L ~ (- , - I -, )' d t h d f Iet ~= Y1 'Y2 ' •.. Ys eno e t e compoun vector 0 samp e

means for the s subpopulations (i=1,2, .. ,12) created by

cross-classifying area and sex. Thus, Y1 represents the

estimated proportion of Charlotte males reporting asthma in.

1973, and Y2 represents the proportion for Charlotte

females. Let P==(P 1 ,P 2 , ••• Ps )' denote the expected value of

Yi. A linear model for P can then be expressed as

E{Y} = P == xp

where X is the known (u x t) model specification matrix with

full rank t and P is the (t x 1) vector of unknown

parameters. The specification of Vy ' the covariance matrix

of y, is found in expressions (3.2.31) and (3.2.29) in

Chapter III. Weighted least squares estimation is

appropriate to obtain an estimate for p as

b == (X'V_- 1X)-lX,v_-1yY Y

By asymptotic theory arguments, b has an approximate

178

multivariate normal distribution with

-1 -1EA{b} = p, and VarA{b} = (X1Vy X)

A cell mean model was first used to gain a preliminary

assessment of the variation among the estimates. This model

is expressed as

E{y} = Xp = IP = P

Linear hypotheses of the form Cp = 0 were then used to

investigate variation, in particular whether there were sex

differences, area differences, or (sex x area) interactions

present for the 1973 estimates of the proportions reporting

asthma. Table 6.2.2 contains the C matrices used in these

tests, as well as the resulting test statistics

Qc = b'C'(CVbC·)-1Cb .

Similar strategies were applied to the 1974 and 1975

estimates, and those results are also presented in Table

6.2.2. The outcomes of these tests are similar for the

three analyses. There are significant sex and area

differences for all three. However, the 1973 and 1974 tests

for (sex x area) interaction are clearly non-significant,

while the corresponding test for 1975 with Qc = 10.74 (5

d.f.) has a p-value of .0568.

Accordingly, reduced models with main effects for area

and sex were fitted for the 1973 and 1974 estimates; its

form was

179

where

r ,I 1I 1 1I· 1 1I 1 1 1I 1 1

X = ·1 1 1 12-

I 1 1I 1 1 1I 1 1I 1 1 1I 1 1I 1 1 1l-

and ~2 is a vector of unknown parameters. If the nearly

significant (sex x area) interaction for 1975 is interpreted

as a chance event, then this same model could have been

applied to 1975. However, the nature,of this potential

interaction is of some descriptive interest, and so a

different model was investigated. This model has the form

r ,1 0 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0

X2 = 1 0 0 0 0 0 1 0 0 0 0 00 1 0 0 0 0 0 1 0 0 0 00 0 1 0 0 0 0 0 1 0 0 00 0 0 1 0 0 0 0 0 1 0 00 0 0 0 1 0 0 0 0 0 1 00 0 0 0 0 1 0 0 0 0 0 1

l-

which includes separate predicted reference parameters for

each area and incremental effects for sex within each year.

The parameter estimates for the 1973 and 1974 analyses are

displayed in Table 6.2.3. The Wald goodness-of-fit

statistic for the 1973 analysis is Qw = 7.25 (d.f.=5) which

Table 6.2.2

Results of Linear Hypotheses Concerning Estimatesof 1+ Asthma in 1973, 1974 and 1975

Hypothesis Year QC df P-value

HI: There are no sex 1973 13.69 1 .0002differences 1974 14.83 1 .0001

1975 18.03 1 .0000

Hz : There are no area 1973 14.46 5 .0129differences 1974 20.82 5 .0009

1975 23.21 5 .0003

H3 : There is no sex x 1973 7.25 5 .2025

e area interaction 1974 4.16 5 .52601975 10.74 5 .0568

Contrast Matrices For Hypothesis Tests

r ,HI : CI = I 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 I

L oJ

r ,Hz : Cz = I 1 0 0 0 0 -1 1 0 0 0 0 -1 I

I 0 1 0 0 0 -1 0 1 0 0 0 -1 II 0 0 1 0 0 -1 0 0 1 0 0 -1 II 0 0 0 1 0 -1 0 0 0 1 0 -1 II 0 0 0 0 1 -1 0 0 0 0 1 -1 IL oJ

r ,H3 : C3 = I 1 0 0 0 0 -1 -1 0 0 0 0 1 I

I 0 1 0 0 0 -1 0 -1 0 0 0 1 II 0 0 1 0 0 -1 0 0 -1 0 0 1 II 0 0 0 1 0 -1 0 0 0 -1 0 1 II 0 0 0 0 1 -1 0 0 0 0 -1 1 IL oJ

180

Table 6.2.3

Specification Matrix, Estimated Parameters,Standard Errors and Predicted Values forModel X2 for the 1973 and 1974 Analyses

181

Specification Matrix X21 0 0 0 0 0 01 0 0 0 0 0 11 1 0 0 0 0 01 1 0 0 0 0 11 0 1 0 0 0 01 0 1 0 0 0 11 0 0 1 0 0 01 0 0 1 0 0 11 0 0 0 1 0 01 0 0 0 1 0 11 0 0 0 0 1 01 0 0 0 0 1 1

Parameter Interpretation e~I : Predicted value for Charlotte~2 : Incremental effect for Birmingham~3 : Incremental effect for NYC~4 : Incremental effect for Utah~s : Incremental effect for California I~8 : Incremental effect for California II~7 : Incremental effect for females

1973 1974Estimates and Estimates and

Standard Errors Standard Errors

~ I : .0592 ± .0065 .0620 ± .0067~2 : .0082 ± .0073 -.0021 ± .0072~3 : .0022 ± .0108 -.0037 ± .0092~4 : -.0022 ± .0079 -.0012 ± .0080~s : .0186 ± .0082 .0105 ± .0085~I : .0186 ± .0090 .0229 ± .0098~7 : -.0188 ± .0047 -.0156 ± .0044

182

has a p-value of .20 and thus is supportive of the model.

The Wald statistic for the same model for 1974 is Q =W

4.16(d.f.=5}, which is also non-significant. No further

model reduction was undertaken for these two analyses.

Predicted values are displayed in Table 6.2.5.

The model X2 for the 1975 data is saturated, and

accordingly there is no goodness-of-fit statistic defined.

Two linear hypotheses concerning the model parameters were

tested. Hi investigates whether the sex effects for

Charlotte, Birmingham, NYC, and Utah are nUll. The

hypothesis is stated

Hi: ~7 = ~8 = ~9 = ~10 = 0

and the resulting test statistic is QC = 3.00 (d.f.=4),

which is non-significant. The other hypothesis investigated

was whether the area parameters for California I and

California II were equivalent, as well as whether their sex

effects were the same. This hypothesis can be stated:

and was also non-significant.

The implied six parameter model was then fit to the

1975 estimates. This model can be stated:

Table 6.2.4 contains the specification matrix for the model

X3 as well as the estimated parameters and standard errors.

Predicted values are contained in Table 6.2.5. The

goodness-of-fit statistic is Qw = 3.84 (d.f.=6), with a p

value of .70.

183

Table 6.2.4

Specification Matrix, Estimated Parameters, Standard Errorsand Predicted Values for Model X3 for the 1975 Analysis



1 0 0 0 0 01 0 0 0 0 0o 1 000 0o 1 0 0 O' 0o 0 1 000o 0 1 000000 1 0 0000 1 0 0o 000 1 0o 0 0 0 1 1o 0 0 0 1 0o 000 1 1

II :2 :3 :4 :5 :6 :

.0666

.0717

.0540

.0574

.0110-.0450

± .0060± .0041± .0072± .0059± .0074± .0097


~I : Predicted value for Charlotte~2 : Predicted value for Birmingham~3 : Predicted value for NYC~4 : Predicted value for Utah~s : Predicted value for California I and II~6: Incremental effect for females

Qw = 3.84 p-value = .699 d.f. = 6

Table 6.2.5

Predicted Values and Standard Errors for 1+ Asthmain 1913, 1914 and 1915 from Univariate Analyses

For each of the Three Years

184

185

6.3 Supplemental Margins in the Analysis of the Proportion

of Colds Reported in 1973, 1974, and 1975

The use of supplemental margins in the analysis of the

CHESS data was first illustrated in Chapter III, as an

example in the section pertaining to supplemental margins

methodology. The primary objective of that analysis was to

estimate the probability of reporting colds in 1973, 1974,

and 1975 (n73,n74,n75) for the study population restricted

to Birmingham females. The proportion vector included

components representing the different response profiles for

the triple year, double years, and single years sources of

data (seven in all). Application. of the appropriate A

matrix formed marginal probabilities of reporting colds in

1973, 1974, and 1975 whenever they existed for each of the

subpopulations representing a different data source. These

functions were then modeled with the structure

(6.3.1) E{F(p)} = Xn

where n=(n73,n74,n75) and

1 0 00 1 00 0 11 0 00 1 0

(6.3.2) X 1 0 0=0 0 10 1 00 0 11 0 00 1 00 0 1

The same analysis was repeated for each of the (area x

sex) combinations, resulting in estimates of w73,n74,n75 for

186

the twelve demographic subpopulations. An alternative

strategy used might have been to combine all (area x sex)

subpopulations together and perform an overall analysis.

One potential problem with this might be that the assumption

that the parameters w,3,w,4,w,5 represent the probability

of reporting a cold in 1913, 1914, and 1915 respectively for

each of the constituent data pattern sUbpopulations may not

hold for each of the (area x sex) sUbpopulations. By

processing one (sex x area) subpopulation at a time, one can

evaluate this assumption individually by evaluating

goodness-of-fit statistics for the model X of (6.3.2) and

let the results identify issues to be addressed with

subsequent analysis efforts. Another potential problem is

mechanical. A great many functions would need to be created

for an overall analysis, i.e. the (12 x 26) A matrix applied

to the (26 x 1) proportion vector PG for Birmingham females

would have to be one of 12 blocks in a (144 x 312)

transformation matrix A applied to a (312 x 1) proportion

vector PG. Such a proportion vector and its (312 x 312)

covariance matrix are beyond the capacity of many current

computer programs.

The estimates resulting from these analyses are

reported in Table 6.3.1, along with their standard errors

and their goodness-of-fit statistics. While there is no

apparent trend over time, females would appear to report

consistently higher proportions of colds than males. The

goodness-of-fit statistics suggest that the assumption of

187

Table 6.3.1Estimates of n ,n 74 ,n 7 by Area and Sex

and Associated G50dness-5f-Fit Statistics

Area Sex n 73 n 74 n 75 n QW

Charlotte Male .552 .514 .460 1468 15.12*Charlotte Female .661 .674 .640 1437 15.18*Birmingham Male .439 .512 .469 3721 24.27**Birmingham Female .599 .635 .574 3374 6.25NYC Male .415 .404 .371 1001 6.03NYC Female .489 .457 .521 932 12.72Utah Male .461 .528 .528 1463 4.90Utah Female .595 .611 .648 1411 21.46*California I Male .427 .471 .471 1650 11.65California I Female .519 .531 .548 1561 8.71CaliforniaII Male .473 .478 .493 1276 17.47*CaliforniaII Female .582 .590 .548 1047 9.29

Note: QWhas 9 d.f.; '* , indicates significance at oc=.10,while' *' indicates significance at oc=.05.

188

similar parameters n73,n74,n75 across the component groups

of.data sources does not hold for all (area x sex)

subpopulations. While seven subpopulations have entirely

adequate goodness-of-fit statistics, five are questionable,

according to an (a=.05) significance level criterion.

Birmingham males have an especially poor fit, with a Qw of

24.27 and an accompanying p-value of .004.

If the goodness-of-fit tests were all non-significant,

one could immediately proceed to a second modeling phase in

which the estimates of the parameter vector

n'=(n 1 ',n2 ', .•. ,ns ') itself becomes the functions of

interest, where ni=(n73,n74,n75) for i=1,2, .. ,12 for the

twelve (sex x area) sUbpopulations. The model

(6.3.3) E{F(n)}=F(n)=Xf

can then be fit to describe the variation across area, sex,

and time. By forcing the X structure on all the (sex x

area) subpopulations, one is essentially choosing to

generate weighted regression estimates for n 73 , n 74 and n 75 ,

averaging those from the seven data source subpopulations.

The goodness-of-fit statistics Qw for the model X are thus

ignored for this objective, but do provide information on

the quality of the estimates.

The alternative strategy would be to assess the

implications of using the incomplete data as augmenting a

complete data analysis where compatibilities exist. One

would use the estimates and covariances from the incomplete

data subpopulations (i.e. double years and single years)

189

when they were similar to the corresponding estimates from

the complete data. One is thus extending the sample size

for the analysis beyond that of the complete data vectors to

gain better precision. However, the assumptions that the

complete data estimates are the 'true' estimates may be at

best a leap of faith and may lead to biased results.

The forced similarity strategy was applied to the

estimated parameters of Table 6.3.1. The preliminary model

E{F(n)} = Xf=1

was first fit to the estimates, where X is the identity

model, and F(n) is the (36 x 1) vector consisting of the

parameter estimates w73 ' w74 ' and w75 for males and females

in Charlotte, Birmingham, NYC, Utah, California I, and

California II respectively. Hypotheses of the form

Ho: cr=Owere evaluated to gain a preliminary assessment of the

sources of variation among the estimates. The resulting

test statistics are displayed in Table 6.3.2. The tests for

sex and area are highly significant, while the test for time

is borderline at the a=.05 level of significance. However,

the interaction of time with area is highly

significant(Qc=47.70,d.f.=10). These findings agree with

those of previous analyses. Unlike most of the findings in

the complete data analyses, there is a significant three-way

interaction.

Due to the existence of the three-way interaction, the

decision was made to continue with separate analyses for

190

females and males. Additional modeling is for the purpose

of seeking appropriate descriptions of the interactions in

the data. Steps are taken to eliminate extraneous sources

of variation so as to find settings (subpopulations x times)

with similar predicted values. The models by which these

are "clustered" are specified and evaluated. These clusters

are of descriptive interest in clarifying the nature of the

(area x sex x time) interaction. The function vector of

probability estimates for males was first constructed by

,. '" ...piecing together (w73,w74,w75) for the males in Charlotte,

Birmingham, NYC, Utah, California I, and California II to

form an (18 x 1) function vector. The appropriate variances

and covariances were also joined together to create the

appropriate (18 x 18) covariance matrix. A model with the

structure

(6.3.4)

was first fitted to the data, where P(wM) denotes the male

predicted probabilities and X is the (18 x 18) specification

matrix with reference parameters for each area for 1973 and

incremental effects for 1974 and 1975 within each area. The

model is saturated, and accordingly there is no goodness-of-

fit defined. The specification matrix, parameter estimates

and their standard errors are displayed in Table 6.3.3.

Parameters t 1-t6

represent predicted reference values for

the probability of reporting colds during 1973 in the areas

Charlotte, Birmingham, NYC, Utah, California I and

California II, respectively; t7-t

12denote incremental

Table 6.3.2

Results of Linear Hypotheses Concerning Estimates of theProbability of Reporting Colds in 1973, 1974, and 1975

191

Hypothesis QC d.f. p-val

1.No differences between sexes for 275.19 1 .0000averages over area x time

2.No variation among areas for averages 160.33 5 .0000over sex x time

3.No variation among time for averages 5.65 2 .0544over ~ex x area

4.Homogeneity across time for differences 1.77 2 .4129between sexes for averages across areas(i.e.no time x sex interaction)

5.Homogeneity across areas for differences 16.53 5 .0055between sexes for averages across time

6.Homogeneity across area for differences 47.70 10 .0000among times for averages across sex

7.No average area x sex x time 21.86 10 .0158interaction

192

Table 6.3.3

Specification Matrix, Parameter Estimates, and StandardErrors for Intermediate Model X for Males

Specification Matrix Parameter Estimates and Ste'sMales Females

r ,1 0 0 1"1 .552 ( .019) .661 ( .018)1 1 0 1"2 .439 .(.012) .599 ( .013)1 0 1 1"3 .415 ( .028) .489 ( .028)

1 0 0 1"4 .461 ( .011) .595 ( .011)1 1 0 1"5 .421 ( .016) .519 ( .016)1 0 1 1"6 .413 ( .018) .582 ( .019)

1 0 0 1"1 -.038 ( .026) .013 ( .024)1 1 0 1"8 .013 (.016) .036 ( .016)1 0 1 1"9 -.011 ( .034) -.033 ( .035)

1 0 0 1"10 .061 ( .023) .015 ( .023)1 1 0 1" 11 .044 (.021) .012 ( .021)1 0 1 1" 12 .005 ( .024) .008 ( .024)

1 0 0 1"13 -.092 ( .025) -.021 ( .023)1 1 0 1" 14 .029 ( .016) -.025 (.016)1 0 1 1" 15 -.044 ( .034) .032 ( .036)

1 0 0 1"16 .061 ( .024) .052 ( .023)

e 1 1 0 1" 11 .044 ( .021 ) .029 ( .022)1 0 11 1"18 .020 ( .026) -.034 ( .021)

L. oJ

193

effects for the year 1974 for the areas listed above

respectively, while parameters t13

-t18

represent the

corresponding effects for 1975.

The significance of the effects for area in 1973,

time, and (time x area) were evaluated with Wald statistics

for the hypotheses H1 , H2 , and H3 respectively:

H1 : t 1=1' 2=t 3=1' 4=1' 5=1' 6

H2 : t 7=1' 8=1' 9=t 10=1' 11 =1' 12=0,

t 13=1' 14=t 15=t 16=1' 17=1' 18=0

H3 : t 7=1' 8=1' 9=1' 10=1' 11 =1' 12'

t 13=1' 14=1' 15=t 16=1' 17=1' 18

All these tests were significant. Qc for the test for area

was 33.59 with 5 d.f. Qc for time was also significant at

55.67, with 12 d.f. and likewise there is clearly a

significant (area x time) interaction (QC=37.19, d.f.=10).

All p-values are less than .0001.

The following reduction was investigated with a test

of the form Ho : 01=0 which was evaluated with the Wald

statistic Qc~'C'(CVtC,)-l~.

C1: t 2=1' 3=1' 5' and t 4=1' 6

This simplification states that the reference parameters for

Birmingham, NYC, and California I are equivalent, and

simultaneously, that the reference parameters for Utah and

California II are equivalent. The C matrix for this

hypothesis is written

194

r 110 1 0 0-1 0 0 0 0 0 0 0 0 0 o 0 0 01

C= I 110 0 1 0-1 0 0 0 0 0 0 0 0 0 o 0 0 01I I10 0 0 1 0-1 0 0 0 ci 0 0 0 0 o 0 0 01L. oJ

•and the resulting QC=1.08 (d.f.=3), which is clearly

supportive. The implied fifteen parameter model,

incorporating C1 by reducing the number of area reference

parameters from 6 to 3, was then fitted to the estimates.

The model structure X2 is presented in Table 6.3.4, along

with the resulting parameter vector t 2 and the corresponding

standard errors. The QW for this model has the same value

as QC for C1 .

Additional analysis efforts are directed at

ascertaining whether variation can be described more

succinctly with model simplifications. As noted in Chapter

V, such simplication is justified by the suitability of

model (6.3.4) for these data, as confirmed by its previously

noted goodness of fit. Further tests concerning model

parameters are intended to motivate subsequent parameter

smoothing, and resulting p-values are considered descriptive

gUidelines for such analysis. Consideration of estimates in

Table 6.3.4 indicates that many of the time effects are

negligible. The time effects for 1974 for Charlotte, NYC,

and California II are very small, as is the 1975 effects for

California II. In addition, the 1975 effects for Birmingham

and California I appear to be similar,

195

Table 6.3.4

Specification Matrix, Estimated Parameters and Standard Er~ors

for Reduced Model XR for the Probability of ReportingColds in 1973, 197~ and 1975 for the Males Analysis


r Specification Matrix X ,1 0 0 0 0 0 0 0 0 0 0 0 oRo 0100 1 0 0 0 0 0 0 0 0 0 0 0100 0 0 0 000 1 0 0 0 0 0o 1 0 0 0 0 000 000 000o 1 0 0 1 0 0 000 000 0 0o 1 0 0 0 0 000 0 1 0 0 0 0o 1 0 0 0 0 0 0 0 0 0 0 0 0 0o 1 000 1 0 0 0 0 0 0 0 0 0o 1 0 0 000 0 0 0 0 1 000o 0 1 0 0 0 0 0 0 0 0 0 0 0 00010001 0 0 0 0 0 0 0 0o 0 1 0 0 0 0 0 0 000 1 0 0o 1 0 0 0 0 0 0 0 0 0 0 0 0 0o 1 0 0 0 0 0 1 0 0 0 0 0 0 0o 1 0 0 0 0 0 0 0 0 0 001 0o 0 1 0 0 0 0 0 0 0 000 0 0o 0 1 0 0 0 001 0 000 0 0o 0 1 000 0 0 0 0 000 0 1

L.

.552

.433

.467-.038

.079-.028

.062

.039

.010-.092

.036-.061

.062

.039

.026

± .019± .009± .013± .026± .014± .022± .021± .018± .021± .025± .014± .023± .021± .017± .022

Parameter Interpretationf R1 : Predicted value for Charlottef R2 : redicted value for Birm, NYC and Cal If R3 : Predicted value for Utah and Cal IIf R4 : ncremental value for 1974 for Charlottef R5 : ncremental value for 1974 for Birminghamf R6 : ncremental value for 1974 for NYCf R7 : ncremental value for 1974 for Utahf R8 : ncremental value for 1974 for California If R9 : ncremental value for 1974 for California IIf R10 : ncremental value for 1975 for Charlottef R11 : Incremental value for 1975 for Birminghamf R12 : Incremental value for 1975 for NYCf R13 : Incremental value for 1975 for Utahf R14 : Incremental value for 1975 for California If R15 : Incremental value for 1975 for California II

196

as do the 1975 effects for Charlotte and NYC. A model-

simplification that would substantiate these points

simultaneously can be stated:

C2 : l'R4='fR6='fR9='fR14=O,l'R11='fR14' l'R10=l'R12

The model implied by the constraints of C2 was fitted to the

estimates. This model can be expressed as:

E{F(n)} = XFl' F

where F(n) is the vector of six sets of the predicted

estimates ni=(n73,n74,n75) for each of the six areas

(i=1,2, ... 6), and l'F is the corresponding parameter vector.

The Qw for this 9 parameter model is QW=6.60 (d.f.=9), and

is indicative of an adequate fit. One can obtain the value

of the test statistic Qc for the simplification C2 by taking

the difference of the Wald statistics for the model XR

and

XF . Hence, QC=6.59-1.08=5.52, the degrees of freedom = 9

3=6, and the p-value of Qc is p=.480. Table 6.3.5 contains

the specification matrix ~, the resulting parameter

estimates and their standard errors, and also predicted

values, residuals, and p-values for z-scores constructed

from the residuals as discussed in Chapter V.

The estimates for the saturated model X for the

analysis of the females are displayed in Table 6.3.3. A

comparison of the area reference parameters shows that the

estimates for the females are consistently higher than for

males. No discernable pattern appears, however, when one

compares the parameters representing time effects. Similar

hypothesis testing and model

197

Table 6.3.5

Specification Matrix, Estimated Parameters and Standard Errorsfor Final Model XR for the Probability of Reporting Colds

in 1973, 1974 and 1975 for the Males Analysis

Specification Matrix

1 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 01 000 0 0 1 0 0o 1 0 0 0 0 0 0 0o 1 0 1 0 000 0o 1 000 0 0 1 0o 1 0 0 0 0 0 0 0o 1 000 000 0o 1 0 000 1 0 0o 0 1 0 0 0 000o 0 1 0 1 0 0 0 0o 0 1 0 0 0 0 0 1o 1 0 0 0 0 0 0 0o 1 000 1 000o 1 0 0 000 1 0001 0 0 000 0o 0 1 000 0 0 0o 0 100 0 000

Parameter InterpretationEstimates and

Standard Errors

Predicted value for CharlottePredicted value for Birm, NYC and Cal IPredicted value for Utah and California IIIncremental effect for 1974 for BirminghamIncremental effect for 1974 for UtahIncremental effect for 1974 for Cal IIncr. effect for 1975 for Char and NYCIncr. effect for 1975 for Birm and Cal IIncremental effect for 1975 for Utah

.530 ± .013

.429 ± .008

.475 ± .010

.082 ± .013

.055 ± .019

.042 ± .017-.064 ± .015

.040 ± .011

.054 ± .020

Area

CharlotteCharlotteCharlotteBirminghamBirminghamBirminghamNYCNYCNYCUtahUtahUtahCalifornia ICalifornia ICalifornia ICalifornia IICalifornia IICalofornia II

Year

197319741975197319741975197319741975197319741975197319741975197319741975

Fhat Std Err Residual P-value

.530 .013 .022 .112

.530 .013 -.016 .276

.460 .014 -.006 .494

.429 .008 .010 .276

.512 .011 .001 .472

.469 .009 -.000 .955

.429 .008 -.014 .588

.429 .008 -.025 .184

.365 .016 .006 .671

.475 .010 -.015 .314

.530 .017 -.002 .314

.529 .017 -.001 .314

.429 .008 -.002 .864

.471 .016 .000 .999

.469 .009 .002 .886

.475 .010 -.002 .896

.475 .010 .003 .875

.475 .010 .018 .251

•

198

simplification efforts to those employed for the males

analysis were performed for the females analysis. Table

6.3.6 contains the specification matrix and parameter

estimates for a ~educed model ~ for the females, and Table

6.3.7 contains the final model XF for the females, as well

as parameter estimates and standard errors. The table also

contains predicted values and residuals. The final model

has eight parameters, including three area reference

parameters, two incremental time effects for 1974 and four

incremental time effects for 1975. The goodness-of-fit

statistic for this model is QW=7.56, with 10 d.f .. It's

non-significance is supportive of the appropriateness of the

model (p-value=.67).

Thus, the forced similarity structure for the

predicted probabilities of reporting colds has been

accomodated with separate models for each sex. Smoothing is

accomplished across the area reference parameters for both

modeling efforts. Charlotte maintains its own reference

value in both cases, but various combinations of the other

parameters can be merged together. The three areas

Birmingham, NYC, and California I are put together for the

males, while the three-way smoothing for the females

consists of Birmingham, Utah, and California II. Both sexes

had significant 1974 effects for Birmingham; the males

included two other 1974 effects while the females had just

one other 1974 effect for NYC. Every area except California

II had significant time effects for 1975 for the males.

199

Table 6.3.6

Specification Matrix, Estimates Parameters and Standard Errorsfor Reduced Model XR for the Probability of ReportingColds in 1973, 197~ and 1975 for the Females Analysis

r Specification Matrix?o

,1 o 0 000 0 0 0 000 0 Estimates and1 o 0 1 0 0 0 0 0 0 0 0 o 0 0 Standard Errors1 o 0 0 0 0 0 0 0 1 0 0 o 0 00 1 0 0 0 0 0 0 0 0 0 0 o 0 0 l' fl .661 ± .0180 100 1 0 0 0 0 0 0 0 o 0 0 1'R2 .594 ± .0090 1 0 000 0 000 1 0 o 0 0 1'R3 .512 ± .0140 o 1 0 0 000 0 000 o 0 0 1'R4 .013 ± .0240 o 1 0 0 1 0 0 0 0 0 0 o 0 0 1'R5 .040 ± .0140 o 1 0 0 0 0 0 0 0 0 1 o 0 0 1'R6 -.054 ± .0251 o 0 0 0 0 0 0 0 000 o 0 0 1'R7 .016 ± .0191 o 0 000 1 000 0 0 o 0 0 1'f,.8 .019 ± .0201 o 0 0 0 0 0 000 0 0 1 0 0 1'R9 -.001 ± .0200 1 0 0 0 0 0 0 0 0 0 0 o 0 0 tRIO -.021 ± .0230 1 0 0 0 001 0 0 0 0 o 0 0 1'R,ll -.020 ± .014 e0 1 0 0 0 0 0 0 0 0 0 0 o 1 0 1'R12 .010 ± .0271 o 0 0 0 0 0 0 0 0 0 0 o 0 0 1'R13 .053 ± .0191 o 0 0 0 0 0 0 1 000 o 0 0 1'R14 .036 ± .0201 000 0 0 0 0 0 000 o 0 1 1'R15 -.045 ± .022

L

Parameter Interpretation1'RI: Predicted value for Charlotte1'R2: Predicted value for Birm, Utah and Cal II1'R3: Predicted value for NYC and Cal I1'R4: Incremental value for 1974 for Charlotte1'R5: Incremental value for 1974 for Birmingham1'R6: Incremental value for 1974 for NYC1'R7: Incremental value for 1974 for utah1'R8: Incremental value for 1974 for California I1'R9: xncremental value for 1974 for California II1'RI0: Incremental value for 1975 for Charlotte1'Rl1: Incremental value for 1975 for Birmingham1'R12: Incremental value for 1975 for NYC1'R13: Incremental value for 1975 for Utah1'R14: Incremental value for 1975 for California I1'R15: Incremental value for 1975 for California II

200

Table 6.3.7

Specification Matrix, Estimated Parameters and Standard Errorsfor Final Model XF for the Probability of Reporting Colds

in 1973, 1974 and 1975 for the Females Analysis


1 0 0 0 0 0 0 0• 1 0 0 0 0 0 0 0

1 0 0 0 0 0 0 00 1 0 0 0 0 0 00 1 0 1 0 0 0 00 1 0 0 0 1 0 00 0 1 0 0 0 0 00 0 1 0 1 0 0 00 0 1 0 0 0 0 00 1 0 0 0 0 0 00 1 0 0 0 0 0 00 1 0 0 0 0 1 00 0 1 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 00 1 0 0 0 0 0 00 1 0 0 0 0 0 00 1 0 0 0 0 0 1

Estimates andParameter Interpretation Standard Errors

Predicted value for CharlottePredicted value for Birm, Utah and Cal IIPredicted value for NYC and California IIncremental effect for 1974 for BirminghamIncremental effect for 1974 for NYCIncremental effect for 1975 for BirminghamIncr. effect for 1975 for UtahIncr. effect for 1975 for California II

.657 ±.. 597 ±.527 ±.038 ±

-.068 ±-.023 ±

.050 ±-.047 ±

.011

.008

.009

.013

.023

.013

.018

.021

CharlotteCharlotteCharlotteBirminghamBirminghamBirminghamNYCNYCNYCUtahUtahUtahCaliforniaCaliforniaCaliforniaCaliforniaCaliforniaCalofornia

Area Year

197319741975197319741975197319741975197319741975

I 1973I 1974I 1975II 1973II 1974II 1975

Qw = 7.56

Fhat Std Err Residual P-value

.657 .011 .0045 .752

.657 .011 .0173 .226

.657 .011 -.0162 .167

.597 .008 .0018 .855

.635 .011 .0002 .855

.574 .011 .0001 .855

.527 .009 -.0378 .155

.458 .021 -.0018 .496

.527 .009 -.0057 .784

.597 .008 -.0018 .904

.597 .008 .0137 .378

.647 .017 .0009 .598

.527 .009 -.0076 .574

.527 .009 .0046 .739

.527 .009 .0211 .101

.597 .008 -.0148 .405

.597 .008 -.0070 .700

.550 .020 -.00167 .632

P-value = .671 d.f. = 10

201

Charlotte and NYC shared one parameter, Birmingham and

California I shared another, and Utah had its own 1975

effect. Birmingham, Utah and California II had individual

time effects for the model for the females. When both

models are considered together, it appears that time is a

major factor in the estimates for Birmingham, while it is

least important for California II. The residuals are

reasonable for both models as judged by their p-values, and

the predicted values demonstrate that the estimated

proportions for the females are consistently higher than

they are for the males.

However, yet another model refinement is possible,

this time motivated by the structure of the predicted values

resulting from the 'final' models. It would appear that the

predicted values of Table 6.3.5 for the males could be

classified into one of four (time x location) clusters,

these being centered around the values .52, .47, .42, and

.36. One can assess the appropriateness of such a

classification scheme by fitting a linear model which

characterizes the original functions correspondingly. Koch,

Johnson, and Tolley (1972) discuss such a strategy in the

analysis of survival rates. Such a structure would thus

model similarities across both time and area, providing an

appropriate descriptive tool for this data, especially in

light of the interactions present. This model can be stated

as

•

202

where X represents the specification matrix suggested byp

the predicted values from the previous model. This

specification matrix, as well as the resulting estimates of

the parameters, are displayed in Table 6.3.8a. The Wald

goodness-of-f i t statistic is QW=.8. 77 (d. f. =14), which is

indicative of an adequate description of the data. Table

6.3.8b includes the analogous information when the same

strategy is applied to the females. In accordance with the

general trend that females report more colds than males, the

cluster values for females are higher than those for males

at .65, .60, .53 and .46. The goodness-of-fit statistic

for the females is also non-significant(Qw=13.30,d.f.=14).

The precise parameter interpretations are listed at the

bottom of the table.

6.3.2 Supplemental Margins as a Means of Extending the

Complete Data Sample Size

Table 6.3.9 contains the estimated probabililities of

reporting colds in 1973, 1974, and 1975 by data pattern

sources for the (area x sex) subpopulations which had

inadequate goodness-of-fit statistics for the model 6.3.3 in

Section 6.3.1. An alternative strategy for incorporating

the incomplete data into the supplemental margins analysis

is to consider the analysis as an extension to a complete

data analysis, and to include those estimates from the

double. and single patterns when they are similar to the

complete (triple) estimates. The estimates and standard

errors from the incomplete data groups were compared to

Table 6.3.8a

Specification Matrix, Parameter Estimates and PredictedValues for Cluster Model for the Proportion ofMales Reporting Colds in 1973, 1974 and 1975

203

Specification Matrix Xp1 0 0 01 0 0 00 1 0 00 0 1 01 0 0 00 1 0 0 Estimates and0 0 1 0 Standard Errors0 0 1 00 0 0 1

Ii.523 ± .007

0 1 0 0 .471 ± .0061 0 0 0 .429 ± .0081 0 0 0 .374 ± .0210 0 1 00 1 0 00 1 0 00 1 0 00 1 0 0 e0 1 0 0


Sl: Predicted value for Charlotte in 1973, 1974,Birmingham in 1974 and Utah in 1974 and 1975

S2: Predicted value for Charlotte in 1975, Birminghamin 1975, Utah in 1973, California I in 1974 and1975, and California II all three years

S3: Predicted value for Birmingham in 1973,NYC in 1973 and 1974, and California I in 1973

S4: Predicted value for NYC in 1975

Qw = 8.77 p-value = .8455 d.f. = 14

•

204

Table 6.3.8b

Specification Matrix, Parameter Estimates and PredictedValues for Cluster Model for the Proportion ofFemales Reporting Colds in 1973, 1974 and 1975

Specification Matrix Xp1 0 0 01 0 0 01 0 0 00 1 0 01 0 0 00 1 0 0 Estimates and0 0 1 0 Standard Errors0 0 0 10 0 1 0

I~.646 ± .007

0 1 0 0 .590 ± .0060 1 0 0 .531 ± .008

e 1 0 0 0 .459 ± .0210 0 1 00 0 1 00 0 1 00 1 0 00 1 0 00 0 1 0


~1: Predicted value for Charlotte, Birmingham in1974, and Utah in 1975

~2: Predicted value for Birmingham in 1973 and 1975,Utah in 1974 and 1975 and California II in 1973 and 1974

~3: Predicted value for NYC in 1973 and 1975, andCalifornia I & II in 1975

~4: Predicted value for NYC in 1974

Qw = 13.30 p-value = .503 d.f. = 14

Table 6.3.9

205

Charlotte Males

.::r~ P74 P75

73 74 75 .613 .507 .44273 14 .453 .459 p-value df73 75 .500 .37074 75 .529 .471 Q ALL -15.12 .0878 973 .570 QW' - 8.43 .3925 874 .541 QW' E

- 6.34 .5008 775 .479 W, CASE

Charlotte Females

P 73 P74 P75

73 74 75 .706 .698 .65373 74 .629 .685 p-value df73 75 .674 .45774 75 .561 .621 Q ALL -15.18 .0861 973 .630 QW'

E - 8.48 .3876 874 .667 QW'

CASE,. 8.45 .2949 7

75 .653W,

Birmingham Males

P73 P74 P75 e73 74 75 .373 .490 .46173 74 .485 .592 p-value df73 75 .482 .57174 75 .530 .470 Q ALL =24.27 .0039 973 .478 QW'

E = 5.04 .4116 574 .488 QW' ,. 2.69 .6107 475 .458 W. CASE

Utah Females

P73 P 74 P75

73 74 75 .563 .622 .68973 74 .688 .656 p-value df73 75 .545 .57674 75 .526 .684 Q ALL =21.46 .0108 973 .630 QW'

E,. 7.89 .3419 1

74 .612 QW'CASE • 7.40 .2851 6

75 .566W,

California II Males

P,3 P74 P,5

73 74 75 .489 .483 .54573 74 .496 .526 p-value df73 75 .297 .45974 75 .542 .500 Q ALL =17.41 .0418 973 .478 QW'

E = 5.89 .5526 174 .425 QW' = 5.12 .5285 675 .437 W. CASE

206

those for the completes (73,74,75) and when they did not

appear similar, (arbitrarily taken to be more than two

standard errors away), were dropped from the analysis. n 73 ,

n 74 , and.n 75 were re-estimated with the model (6.3.3)

modified by deletion of the rows of X corresponding to the

eliminated functions. The goodness-of-fit statistic was

then evaluated to determine whether the deletions were

appropriate. These statistics are written QW,E (for

element-wise deletion) and also appear in Table 6.3.9.

One may consider it prudent to delete from

consideration those estimates from an entire data pattern

group instead of just individual members, with the rationale

being that whatever process led to a questionable estimate

for one attribute may have been likely to affect others as

well. This may be considered a more conservative approach.

Thus, the estimation process was repeated a third time,

dropping those functions from an entire data pattern group

when any of its members was eliminated. The goodness-of-fit

stutistic for these models are written QW,CASE and also

appear in Table 6.3.9.

Seven of the ten element deletions (circled) are from

the double years, two of the single year deletions are for

1975 and the other is for 1973. Of the seven doubles

eliminations, three are from the 73,75 data pattern group,

which is the doubles group about which one might have the

most doubt. Note that there are a few cases where the

assumption that the 'complete' estimates are 'true' is

207

questionable, especially for Birmingham males for 1973. In

that situation, the incomplete data group estimates are

clustered at .485, .482 and .478, while the 13,74,75

estimate is .373. Table 6.3.10 contains the results of the

estimation of n13,n14,n75 using a modified version of

(6.3.3) for both the element-wise deletions and case-wise

deletions. The starred rows correspond to the

subpopulations where estimates changed. Probably the most

noticeable difference is the estimate for n 73 for Birmingham

males, which drops from .439 for the forced similarity

structure model to .384 when element-wise deletion was

performed. The next largest change is for n 75 for Utah

females, which increases from .648 in Table 6.3.1 to .688

for the element-wise deletions. The differences from the

estimates for element-wise deletion to those for case-wise

deletions are very minor, as might be expected since the

element-wise deletion process should have smoothed out most

of the inconsistent estimates.

The identity model expressed in (6.3.3) was fitted to

these element-wise estimates to gain a preliminary

assessment of the sources of variation. Table 6.3.11

contains the linear hypotheses concerning the resulting

parameter estimates, which are the predicted probability

functions themselves. The Wald statistics to evaluate these

hypotheses are also included. The results are very nearly

identical to those obtained for the same tests for the

forced similarity structure estimates, with a significant

Table 6.3.10

Estimates of n73' n74 and n75 by Area and Sex Adjustingfor Element-Wise Deletions and Case-Wise Deletions

208

Element-Wise Case-Wise

Area Sex n 73 n 74 n 75 n 73 n 74 n 75

Charlotte M .581 .509 .460 .581 .525 .461Charlotte F .661 .675 .650 .660 .675 .650Birmingham M .384 .502 .462 .375 .501 .462Birmingham F .599 .635 .574 .599 .635 .574NYC M .415 .404 .371 .415 .404 .371NYC F .489 .457 .521 .489 .457 .521Utah M .461 .528 .528 .461 .528 .528Utah F .591 .616 .688 .591 .613 .687California I M .427 .471 .471 .427 .471 .471California I F .519 .531 .548 .519 .531 .548California II M .485 .484 .528 .485 .484 .535e California II F .582 .590 .548 .582 .590 .548

209

Table 6.3.11

Results of Linear Hypotheses Concerning Estimates of theProbability of Reporting Colds in 1973, 1974 and 1975

Hypothesis

1. No difference between sexes foraverages over area*time

2. No variation among areas foraverages over sex*time

3. No variation among times foraverages over sex*area

4. Homogeneity across time for differencebetween sexes for average across areas

5. Homogeneity across area of differencesbetween sexes for averages across time

6. Homogeneity across area for differencebetween time for averages across sex

7. No sex*time*area interaction

Qc p-value df

1. 270.77 .00000 12 . 160.33 .00000 53. 5.84 .05370 14. 1. 73 .42086 25. 27.09 .00005 56. 1+1. 0 5 00000 1('1

7. 21. 86 .01580 10

210

three-way interaction surfacing, as well as a significant

(area x time) interaction and a significant (area x sex)

interaction.

Instead of pursuing linear model descriptions of the

estimates obtained using element-wise or case-wise deletion,

it was decided to investigate hypotheses concerning the

variation of time within area and area within time for each

of the six sets of estimates: the forced estimates for both

males and females, the element-wise deleted estimates for

males and females, and the case-wise deleted estimates for

males and females. If the different estimation procedures

led to vast differences in the variation among the

estimates, such differences would appear in the results of

such analysis.

Accordingly, identity cell mean models of the form

(6.3.3) were fit to each of these six groups of estimates of

rr 73 , rr 74 , rr 75 for the (area x sex) subpopulations.

Hypothesis tests of the form ct=O were then performed to

investigate the sources of variation of interest. These

tests are evaluated with Wald test statistics. Table 6.3.12

contains the C matrices used in conjunction with the various

tests, and Table 6.3.13 displays the resulting tests

statistics for each of the six groups of estimates.

On the whole, there does not appear to be much

difference in the results of these tests for the element

wise deletion estimates or case-wise deletion estimates

compared to each other or compared to the forced similarity

211

Table 6.3.12

Contrast Matrices for Tests of Time Differences WithinAreas and Area Differences Within Time for Estimates

of 1+ Colds via Supplemental Margins

1. No time differences in Charlotte

[ 6 o -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2. No time differences in Birmingham

[ g 0 0 1 o -1 0 0 0 0 0 0 0 0 0 0 0 0 ]0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0

3. No time differences in NYC

[ g 0 0 0 0 0 1 o -1 0 0 0 0 0 0 0 0 0 ]0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0

4. No time differences in Utah

[ g 0 0 0 0 0 0 0 0 1 0 -1 0 0 0 0 0 0 ]0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0

5. No time differences in California I e[ g 0 0 0 0 0 0 0 0 0 0 0 1 0 -1 0 0 0 ]0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0

6. No time differences in California II

[ g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 -1 ]0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1

7. No area differences in 1973

U0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0

]0 0 1 0 0 0 0 0 0 0 0 0 0 0 -1 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 -1 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 -1 0 00 0 0 0 0 0 0 0 0 0 0 1 0 0 -1 0 0


[I 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0

]0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -1 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -1 00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 -1 00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -1 0


[I 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1

]0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -10 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 -10 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -1

Table 6.3.13

Test Statistics and Corresponding P-Values forthe Linear Hypotheses Investigating Time Variation

Within Areas and Area Variation Within Time

212

Forcd Forcd Elem Elem Case CaseHypothesis df Males Femal Males Femal Males Femal

No time difference 2 13.79 2.24 19.40 1. 22 19.95 1. 22in Charlotte .0010 .3260 .0000 .5430 .0000 .5420

e No time difference 2 22.32 17.24 31. 39 17.24 33.70 17.24in Birmingham .0000 .0001 .0000 .0002 .0000 .0001

No time difference 2 2.07 4.96 2.07 4.96 2.07 4.96in NYC .3550 .0840 .3550 .0840 .3550 .0840

No time difference 2 10.99 5.52 10.99 14.60 10.99 14.80in Utah .0040 .0633 .0041 .0007 .0041 .0006

No time difference 2 5.83 1. 79 5.83 1. 79 5.83 1. 79in California I .0540 .4090 .0540 .4090 .0540 .4090

No time difference 2 .66 2.91 2.74 2.91 3.31 2.92in California II .7190 .2320 .2540 .2320 .1910 .2326

No area difference 5 33.59 47.51 53.44 47.20 55.97 45.48in 1973 .0000 .0000 .0000 .0000 .0000 .0000



213

estimates. As far as area variation goes, there are very

significant area differences for each year under

consideration for each set of estimates. When one examines

the results of the tests concerning time differences within

areas for the males, the differences are in degree of

magnitude and not statistical interpretation. There are

significant differences in time for Charlotte for all three

sets of estimates, with the test statistics for the element

wise deletions and case-wise deletions being larger than

that for the forced similarity estimates. The same

situation occurs for the time differences in Birmingham.

Meanwhile, the time differences for males in California II

are non-significant, although the test statistic has a

higher value for the element-wise and case-wise deletion

estimates. The other three areas for males have identical

results, since none of the relevant estimates were deleted.

The females do exhibit a difference between the forced

similarity estimates and the other estimates. The test for

time differences in Utah only approaches significance with a

p-value of .0633 for the forced similarity estimates, but

those for the element-wise deleted estimates and the case

wise deleted estimates are strongly significant with p <

.001. However, since the element-wise and case-wise deleted

estimates are a consequence of the debatable assumption that

the 'Complete' data are 'correct', one cannot judge which of

these results is correct. So, only one disparity can be

noted, and it is marginal. In general, the prevailing theme

214

is similar findings across methods. Certainly, using the

forced similarity estimates is a much easier process, and

possibly preferable for that reason.

However, on the whole, the use of supplemental margins

is a time-consuming and computing intensive process. Since

an alternative would be the use of multivariate ratio

estimation, the estimates of n 73 , n 74 , and n 75 were

calculated so that the two adjustment procedures could be

compared for this analysis. Future chapter sections will

describe the application of multivariate ratio estimation.

Table 6.3.14 contains the estimates of n 73 , n 74 , and

n 75 via supplemental margins and multivariate ratio

estimation, along with the corresponding standard errors, by

the (area x sex) sUbpopulations under study. The estimates

appear to be very similar. As would be expected, the fact

that supplemental margins uses a weighted estimation

procedure leads it to provide lower standard errors,

although the degree of difference is very small. To

summarize, the differences between the two procedures in

terms of producing these estimates appears to be minimal.

Since multivariate ratio estimation will be seen to be much

easier in terms of application, it might be the tool of

choice in a similar situation. However, the benefits of

supplemental margins may be worth the effort when the data

set is not very large. With the CHESS data, however, it

would appear that the use of multivariate ratio estimation

to adjust for missing data is satisfactory.

215

Table 6.3.14

Estimates of W73, W74 and W7S by Area and Sex by SupplementalMargins and Multivariate Ratio Estimation

Supplemental RatioMargins Estimation

Area Sex w73 w74 w75 w73 w74 w75

Charlotte M .552 .514 .460 .549 .509 .464(.019)(.019)(.017) (.020) (.020) (.017)

Charlotte F .661 .674 .640 .661 .672 .640(.018)(.018)(.016) (.018)(.018)(.016) eBirmingham M .439 .512 .469 .444 .508 .466(.012)(.011)(.011) (.013)(.013)(.011)

Birmingham F .599 .635 .574 .601 .632 .575(.013)(.011)(.011) (.013)(.011)(.011)

NYC M .415 .404 .371 .428 .410 .373(.028) (.021) (.021) (.029)(.021)(.021)

NYC F .489 .457 .521 .500 .467 .520(.028) (.021) (.023) (.029) (.022) (.023)

Utah M .461 .529 .529 .464 .531 .531(.017)(.018)(.018) (.018) (.018) (.018)

Utah F .595 .611 .648 .600 .615 .638(.017)(.017)(.017) (.017)(.018)(.017)

California I M .427 .471 .471 .425 .471 .471(.016) (.016) (.015) (.016) (.016) (.015)

California I F .519 .531 .548 .521 .535 .546(.016) (.017) (.016) (.016)(.017)(.016)

California II M .473 .478 .493 .478 .488 .490(.018)(.019)(.018) (.018) (.019) (.019)

California II F .582 .590 .548 .579 .592 .548(.019)(.020)(.020) (.020)(.021)(.020)

216

6.4 Multivariate Ratio Estimation for the Analysis of Mean

Colds for the Six Point Data

The final analysis for the complete data in the

preceding chapter was for mean colds in the years 1973,

1974, and 1975. Multivariate ratio estimation is a useful

technique for handling missing data in the analysis of

means, as well as other types of estimators. The sources of

data included in this analysis are the complete data

subgroups and "the doubles subgroups (since the data must

thus be present for six time points, it is referred to as

the six points data). Since three measures per year will be

estimated, this means that there will be one missing value

per observation for the doubles data, or (2131 + 1208 + 434

= 3773) missing altogether, where the numbers in parentheses

represent the number of observations for

(1973,1974),(1973,1975), and (1974,1975) respectively.

Thus, since there are 4002 complete observations, 3773/(3 *

7775), or 16 percent of the data will be missing. If the

single years were also included, a similar calculation would

show that nearly 50 percent of the observations would be

missing. Since it has been suggested that the amount of

missing data for this kind of adjustment scheme be limited

to around 10 percent (Stanish, Gillings, and Koch 1978), it

was decided not to include the single years in this

analysis. Also, unlike the supplemental margins analysis,

multivariate ratio estimation will not result in an

intermediate goodness-of-fit test with which one can assess

217

the quality of the estimates. By including the doubles, the

sample size for an analysis of mean colds is extended from

4002 to 7775, almost a 95 percent increase.

The shape of the analysis is partly constrained by the

abilities of current computer software. The program MISCAT

is a Fortran program which will generate the necessary ratio

estimates and perform asymptotic regression. However,

there is a limit to the number of functions which it can

handle. This is currently eighty, and must include

indicator functions for the presence or absence of data

values. Thus, the number of possible elements per function

vector is actually forty. This means that modeling three

response variables across the twenty-four subpopulations

generated by an (area x sex x race) cross-classification

would not be possible, since 72 functions would result.

However, one can choose to separate the analysis into

smaller pieces, and if desired, splice the resulting

parameter estimates and covariances back together for a

second modeling stage, much like the predicted estimates of

n 73 ,n 74 , and n 75 were modeled in a second stage in section

6.3. Since sex has shown itself to be a major source of

variation in previous work, it was decided to split the

analysis into two pieces--one for males, and one for

females.

Table 6.4.1 contains the estimates of mean colds for

i973 for males and females by area and race, and Tables

6.4.2 and 6.4.3 contain the estimates for 1974 and 1975.

Table 6.4.1

Mean Colds by Area, Sex and Race for 1973

MAL E SArea Race Mean N Missing

Charlotte White .80 274 43Charlotte Other .68 130 25Birmingham White .64 456 364Birmingham Other .39 245 211NYC White .56 119 83NYC Other .46 26 41Utah White .67 431 95Utah Other .32 38 11California I White .58 571 136California I Other .61 85 23California II White .13 421 87California II Other .52 40 8TOTAL .63 2836 1121

F E MAL E SCharlotte White 1.04 314 31

e Charlotte Other 1.15 123 35Birmingham White .93 500 358Birmingham Other .86 249 200NYC White .13 105 62NYC Other .61 31 42Utah White .96 432 52Utah Other .13 51 5California I White .80 504 113California I Other .67 85 29California II White .92 363 48California II Other .56 41 3TOTAL .90 2798 978

TOT A LCharlotte White .93 588 74Charlotte Other .91 253 60Birmingham White .19 956 722Birmingham Other .63 494 411NYC White .64 224 145NYC Other .54 51 83Utah White .81 863 147Utah Other .55 89 16California I White .68 1015 249California I Other .61 110 52California II White .82 184 135California II Other .54 81 11TOTAL .76 5634 2105

218

Table 6.4.2


219

MAL -E SAREA Race Mean N Missing

Charlotte White .77 285 32Charlotte Other .76 141 14Birmingham White .82 770 50Birmingham Other .63 450 6NYC White .5"5 192 10NYC Other .48 64 3Utah White .80 488 36Utah Other .53 45 4California I White .67 667 40California I Other .60 101 7California II White .77 472 36California II Other .36 47 1TOTAL .72 3722 241

F E MAL E SCharlotte White 1.14 313 32Charlotte Other 1.08 144 14 eBirmingham White 1.02 828 30Birmingham Other 1.01 439 10NYC White .70 162 5NYC Other .58 71 2Utah White .97 455 29Utah Other .67 52 4California I White .83 583 34California I Other .59 106 8California II White .96 390 21California II Other .66 41 3TOTAL .94 3584 192

TOT A LCharlotte White .96 598 64Charlotte Other .92 285 28Birmingham White .92 1598 80Birmingham Other .82 889 16NYC White .62 354 15NYC Other .53 135 5Utah White .88 943 67Utah Other .61 97 8California I White .74 1250 74California I Other .60 207 15California II White .86 862 57California II Other .50 88 4TOTAL .83 7306 433

Tabl~, 6.4.3


MAL E SArea Race Mean N Missing

Charlotte White .64 211 106Charlotte Other .45 102 53Birmingham White .75 717 103Birmingham Other .53 430 26NYC White .56 156 46NYC Other .46 59 8Utah White .85 453 73Utah Other .70 44 5California I White .70 627 80California I Other .67 101 7California II White .79 383 125California II Other .46 37 11TOTAL .70 3320 643

F E MAL E SCharlotte White .98 244 101e Charlotte Other .85 116 42Birmingham White .96 765 93Birmingham Other .80 422 27NYC White .76 129 38NYC Other .68 60 13Utah White 1.18 431 53Utah Other .82 45 11California I White .86 540 77California I Other .67 101 13California II White 1.03 329 82California II Other .68 38 6TOTAL .93 3220 556

TOT A LCharlotte White .82 455 207Charlotte Other .67 218 95Birmingham White .86 1482 196Birmingham Other .67 852 53NYC White .65 285 84NYC Other .57 119 21Utah White 1.01 884 126Utah Other .76 89 16California I White .78 1167 157California I Other .67 202 20California II White .90 712 207California II Other .57 75 17TOTAL .81 6540 1199

220

221

The only discernable trend would appear to be that females

consistently report more colds than males, a point which

supports the decision to model the sexes separately. Also

included in the tables are sample sizes and the number of

those missing values by area and sex. Sample sizes are

sufficient for the subsequent functional aSYmptotic

regression analysis, as the smallest number of non-missing

observations on which a mean estimate is based is 26 for

'other' males in NYC for 1973. Most of the other non

missing sample sizes are in the hundreds, one of the

benefits of such a large dataset. The use of asymptotic

results is thus justified.

Let Yilt = (Yi1l'Yi2l' ... Yi3l) represent the vector of

responses for the l-th subject in the i-th subpopulation,

where i=1,2, ... 12 (i=l indicates Charlotte whites, i=2

indicates Charlotte 'others,', and so on up to i=12 for

California II 'others'). This discussion is limited to

males. Let {Yikl } denote the number of colds for the k-th

year for the l-th person in the i-th subpopulation. Let Pi'

= (Pi1,Pi2,Pi3) represent the expected value of of Yilt. An

appropriate estimator for Pi which takes into account the

missing value structure of the data is the ratio estimator

(3.5.1) of Chapter III. If u il = (ui1l,ui2l,Ui3l), denotes

a random vector of indicator variables for whether or not

the k-th response value is present, the ratio estimator for

P ik can be expressed as

222

(6.4.1)

where f ikl = Yikluikl· If gil =

(fi11,fi21,fi31,uill,ui21,Ui31) and gi denotes the sample

mean vector of the gills, the i-th ratio estimator can be

written

( 6 • 4 • 2 )

with A = [I 3

eXp{Alog(9i )}

estimated covariance matrix of Yi

..

can be expressed as

•(6.4.3)

-1= D AD V D

'"" - -Yi gi gi

where the covariance matrix for 9i is calculated according

to expression (3.5.3) of Chapter III. The estimated Yi and

their corresponding standard errors are displayed in Table

6.4.4. The estimates for females are also included."..

If Y

is written y = (Y1 'Y2' .. 'Y12)' and has expected value P =(P 1 ,P2' •• ,P l2 )', then the estimated covariance of y,vy ,. can

be computed from (6.4.3) above, rePlacingYi byy, gi by g =

(gl' g2,···g12)', and Vgi by the matrix Vg' where Vg is a

diagonal block matrix with Vgi as th i-th block ....r"

Variation among the elements of y can be analyzed with

functional asymptotic regression methodology. Let a model

Table 6.4.4

Estimate and Standard Errors for Mean Colds in1973, 1974 and 1975 by Sex, Area and Race

223

Males Females •

Area Race 1973 1974 1975 1973 1974 1975 eCharlotte White Yl .803 .768 .635 1.045 1.144 .980Charlotte Other Y2 .685 .759 .451 1.155 1.083 .853Birmingham White Y3 .643 .816 .755 .930 1.016 .957Birmingham Other Y4 .388 .627 .535 .863 1.011 .801NYC White Y5 .563 .552 .564 .733 .704 .760NYC Other Y6 .462 .484 .458 .613 .578 .683Utah White Y7 .668 .801 .848 .958 .970 1.183Utah Other Y8 .316 .533 .705 .725 .673 .822California I White Y9 .581 .667 .700 .802 .834 .863California I Other Y10 .671 .604 .673 .671 .594 .673California II White Yll .734 .773 .794 .920 .964 1.033California II Other Y12 .525 .362 .459 .561 .659 .684

224

for y be stated as

(6.4.4)

where X denotes a (u x t) specification matrix of interest

and ~ represents a (t x 1) vector of unknown parameters.

The sample sizes involved with these subpopulations are

~

sufficiently large such that y is approximately multivariate

normal. The preliminary model

was again used as a starting point to assess sources of

variation among the estimates. Table 6.4.5 contains the

results of linear hypothesis tests targeted at investigating

potential sources of variation. The Wald statistic QC is

presented for each hypothesis, along with the d.f. and p-

value.

There is a very significant race effect, as well as a

significant area effect (0=.05 level of significance). Time

is borderline. The three-way interaction is not

significant, and neither is the two-way interaction of race

and time. There is however, a very significant interaction

of time and area (QC = 50.86, d.f.=10), as well as a

significant (race x area) interaction (Qc = 13.81, d.f.=5).

Accordingly, the model structure chosen as the framework for

continued linear modeling was one in which effects for race

and time were fit within areas. The area reference

parameters are the predicted mean functions for 1913. Table

6.4.6a contains the specification matrix and Table 6.4.6b

contains the resulting parameter estimates and standard

Table 6.4.5 .

Hypotheses and Resulting Test Statistics Concerning MeanColds in 1973, 1974 and 1975 for Analyses of Both Sexes.

Hypotheses

1. No difference between races foraverages over area x time

2. No variation among areas foraverages over time x race

3. No variation among times foraverages over race x area

4. Homogeneity across time for differenceamong areas for average across race(i.e. no area x time interaction)

5. Homogeneity across area for differencesbetween races for averages across time

6. Homogeneity across time for differencebetween races for averages across area

7. No race x time x area interaction

Males FemalesQc P-VALUE Qc P-VALUE DF

1. 31.35 .000 28.05 .000 12 . 14.52 .013 67.75 .000 53. 5.80 .055 0.84 .656 24. 50.86 .000 35.43 .000 105. 13.81 .017 12.83 .025 56. 0.05 .972 1. 55 .461 27. 10.52 .396 6.35 .785 10

225

226

Table 6.4.6a

Specification Matrix for Model X2 for Mean Colds in1973, 1974 and 1975 for Males and Females


1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0

e 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1

Parameter Interpretation131, 135, 139, 1313, 1317, 1321: Predicted value for 1973 for

Charlotte, Birmingham, NYC, Utah, Cali~ornia I & II132,136,1310,1314,1318,1322: Inc. effect for 'other' race

for Charlotte, Birmingham, NYC, Utah, California I & II133, 137, 1311' 1315' 1319' 1323: Inc. effect for 1974 for

Charlotte, Birmingham, NYC, Utah, California I & II134, 138, 1312, 1316, 1320, 1324: Inc. effect for 1975 for

Charlotte, Birmingham, NYC, Utah, California I & II

Table 6.4.6b

Estimates and Standard Errors for Model X2 for MeanColds in 1973, 1974 and 1975 for Males and Females

227

Males Females

fl 1 .803 ± .046 1.082 ± .049fl 2 -.115 ± .058 -.028 ± .068fl 3 .006 ± .054 .048 ± .055fl 4 -.193 ± .059 -.141 ± .060fl 5 .626 ± .032 .932 ± .037fl 6 -.218 ± .036 -.079 ± .042fl 7 .201 ± .033 1.077 ± .037fl 8 .127 ± .034 -.005 ± .039 efl 9 560 ± .066 .733 ± .072fl 10 -.090 ± .071 -.113 ± .098fl 11 -.005 ± .071 -.035 ± .079fl 12 -.001 ± .077 .038 ± .083fl 13 .655 ± .038 .968 ± .044fl 14 -.300 ± .075 -.290 ± .090fl 15 .146 ± .045 .001 ± .050fl 16 .207 ± .049 .204 ± .053fl 17 .595 ± .032 .810 ± .038fl 18 -.011 ± .068 - .196 ± .061fl 19 .061 ± .038 .016 ± .041fl 20 .101 ± .040 .054 ± .046fl 21 .754 ± .041 .917 ± .047fl 22 -.364 ± .095 -.342 ± .094fl 23 .007 ± .045 .052 ± .052fl 24 .034 ± .051 .115 ± .064

228

errors. The goodness-of-fit statistic is QW = 10.77

(d.f.=12), and is indicative of an adequate linear

description of the mean cold estimates.

The modeling process continues with effort towards

model simplification so that succinct linear

parameterizations of the estimates can be determined for

descriptive purposes. Further stages consist of testing

linear hypotheses concerning parameter estimates and fitting

the reduced model implied by the results of such hypothesis

testing. Attention was first directed at the parameters

reflecting time effects. An intermediate model X3 was

fitted which included a reduction in the number of

parameters for time from twelve to five. Eliminated were

the 1974 effects for Charlotte, NYC, California I and

California II, as well as the 1975 effects for NYC and

California I. The 1974 parameters for Birmingham and Utah

were combined, as were the 1975 parameters for Brimingham

and California I.

Subsequent effort was aimed at smoothing race

parameters and then area parameters. Race effects for Utah

and California I proved to be non-significant, as was

predicted from their estimated values for model X2 . The

parameters for Birmingham, Utah and California II were

smoothed into one, leaving the incremental effect for

Charlotte as the only 'stand-alone' race parameter. The

final reduced model reflected area parameter smoothing as

the last step in the model-fitting process. Charlotte and

229

California II were combined, and NYC and California I were

also combined. This ten parameter model is displayed in

Table 6.4.7a, along with the parameter estimates and their

standard errors'. The goodness-of-fit for this model is Qw =

20.05 (d.f.=26). The p-value for Qw is .79.

The analysis of the females proceeded in a similar

fashion to that for the males. Table 6.4.4 also contains

the estimated Yi for the females in the respective (race x

area) subpopulations. A comparison with the males shows

that the females consistently reported more colds. No other

trend is readily apparent from comparing the table of

estimates. A preliminary investigation of the sources of

variation among these estimates was pursued by fitting the

identity model to the female estimates and utilizing

hypothesis tests concerning the resulting parameter

estimates. These tests revealed no substantial differences

from the results obtained in the analysis for the males.

Table 6.4.5 contains the results of these hypothesis tests.

The significant two-way interactions are (race x area) and

(time x area). The three-way interaction is non

significant. The only different result from those for the

males is that the average effect for time is quite non

significant, where it was borderline non-significant for

males(~=.05 level of significance). This test has limited

meaning, however, in the face of a very substantial

interaction of time with area.

Consequently, the same 24 parameter reduced model X2

Table 6.4.7a

Final Model for Mean Colds (for males) in1973, 1974 and 1975 by Area and Race

230

Specification Matrix ~

1 0 0 0 0 0 o 0 0 01 0 0 0 0 0 o 0001 0 0 0 0 0 0 1 o 01 0 0 1 0 0 o 0 0 01 0 0 1 0 0 o 0001 0 0 1 0 0 o 1 o 00 1 0 0 0 0 o 0 0 00 1 0 0 0 1 o 0 0 00 1 0 0 0 0 o 0 1 00 1 0 0 1 0 o 0000 1 0 0 1 1 000 00 1 0 0 1 0 o 0 1 0 Estimates and0 0 1 0 0 0 o 0 0 0 Standard Errors0 0 1 0 0 0 o 0 0 00 0 1 0 0 0 o 000 1 : .773 ± .0230 0 1 0 0 0 o 0 0 0 Z : .641 ± .0230 0 1 0 0 0 o 0 0 0 3 : .564 ± .0210 0 1 0 0 0 o 0 0 0 4 : -.089 ± .0520 1 0 0 0 0 o 0 0 0 5 : -.247 ± .0300 1 0 0 0 1 o 0 0 0 6 : .184 ± .0250 1 0 0 0 0 000 1 7 : .084 ± .032e 0 1 0 0 1 0 o 0 0 0 a : -.178 ± .0450 1 0 0 1 1 o 0 0 0 9 : .122 ± .0230 1 0 0 1 0 000 1 1 0 : .222 ± .0430 0 1 0 0 0 o 0 0 00 0 1 0 0 0 1 0000 0 1 0 0 0 o 0 1 00 0 1 0 0 0 o 0 0 00 0 1 0 0 0 1 0000 0 1 0 0 0 001 01 0 0 0 0 0 000 01 0 0 0 0 0 000 01 0 0 0 0 0 o 0 0 01 0 0 0 1 0 o 0 0 01 0 0 0 1 0 o 0 0 01 0 0 0 1 0 000 0


SI : Predicted value for 1973 for Charlotte and Cal IISZ : Predicted value for 1973 for Birmingham and UtahS3 : Predicted value for 1973 for NYC and Cal~fornia IS4 : Incremental effect for race for CharlotteSs : Incremental effect for race for Birm, Utah and Cal IIS6 : Incremental effect for 1974 for Birmingham and UtahS7 : Incremental effect for 1974 for California ISa : Incremental effect for 1975 for CharlotteS9 : Incremental effect for 1975 for Birmingham and Cal ISl 0 : Incremental effect for 1975 for Utah

Qw = 20.05 p-value = .789 d.f. = 26

Table 6.4.7b

Final Model for Mean Colds (for females) in1973, 1974 and 1975 by Area and Race

231

Specification Matrix XF1 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 01 0 0 0 0 0 1 0 01 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 01 0 0 0 0 0 1 0 0 '"0 1 0 0 0 0 0 0 00 1 0 0 0 1 0 0 00 1 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 1 0 0 0 1 0 0 00 1 0 0 0 0 0 0 0 Estimates and0 0 1 0 0 0 0 0 0 Standard Errors0 0 1 0 0 0 0 0 00 0 1 0 0 0 0 0 0 I 1.096 ± .0360 0 1 0 0 0 0 0 0 2 .925 ± .0180 0 1 0 0 0 0 0 0 3 .795 ± .0240 0 1 0 0 0 0 0 0 4 -.287 ± .0630 1 0 0 0 0 0 0 0 s - .153 ± .0590 1 0 0 0 0 0 0 0 s .099 ± .0270 1 0 0 0 0 0 1 0 , -.168 ± .0510 1 0 1 0 0 0 0 0 • .226 ± .0430 1 0 1 0 0 0 0 0 9 .083 ± .0510 1 0 1 0 0 0 1 00 0 1 0 0 0 0 0 00 0 1 0 0 0 0 0 00 0 1 0 0 0 0 0 0 e0 0 1 0 1 0 0 0 00 0 1 0 1 0 0 0 00 0 1 0 1 0 0 0 00 1 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 1 0 0 0 0 0 0 10 1 0 1 0 0 0 0 00 1 0 1 0 0 0 0 00 1 0 1 0 0 0 0 1


SI: Predicted value for 1973 for CharlotteSz: Predicted value for 1973 for Birm, Utah and Cal IIS3: Predicted value for 1973 for NYC and California IS4: Incremental effect for race for Utah and California IISs: Incremental effect for race for California ISs: Incremental ~ffect for 1974 for BirminghamS,: Incremental effect for 1975 for CharlotteS.: Incremental effect for 1975 for UtahS9: Incremental effect for 1975 for California II

Qw = 30.54 p-value = .29 d.f. = 27

232

was fitted for females as was for males. There are six

predicted reference values for areas, six race effects

within areas, and also an effect for 1974 and 1975 within

each area. The resulting parameter estimates are displayed

in Table 6.4.6, beside the analogous parameter estimates for

the males. A cursory examination shows that there are

similarities in the estimates, at least in direction if not

always in magnitude.

The process of hypothesis testing and fitting reduced

models led to the final model displayed in Table 6.4.7b, It

consists of predicted reference values in 1973 for Charlotte

separately, Birmingham and Utah and California II combined,

and the other is for NYC and California I. The only time

effect for 1974 remaining is one for Birmingham, and 1975

time effects in the model are those for Charlotte, Utah, and

California II.

6.5 Analysis of the Proportion of those Reporting Asthma in

1973, 1974, and 1975 with Multivariate Ratio Estimation

Table 6.5.1 contains estimates of the proportions of

CHESS sUbjects who reported incidents of asthma in the years

1973, 1974, and 1975. These estimates are from the entire

dataset, not just those having six points as were the focus

of the analysis in the preceeding section. Consequently,

much of the data for this analysis will be incomplete. The

table includes tabulations of the number of observed and

missing values by each (area x sex) subpopulation. More

than half of the data values are missing for the estimation

Table 6.5.1

Proportion of CHESS Subjects Reporting Incidentsof Asthma in 1973, 1974, and 1975

233

197 3Area Sex Mean N Missing

Charlotte Male .06 648 820Charlotte Female .04 672 765Birmingham Male .07 1553 2168Birmingham Female .05 1453 1921NYC Male .05 297 704NYC Female .05 304 628Utah Male .05 801 662Utah Female .04 811 600California I Male .09 990 660California I Female .05 917 644California II Male .09 730 546California II Female .05 630 417TOTAL .06 9806 10535

1 974Charlotte Male .06 633 835 eCharlotte Female .05 646 791Birmingham Male .06 1958 1763Birmingham Female .05 1869 1505NYC Male .06 547 454NYC Female .04 499 433Utah Male .05 791 672Utah Female .03 769 642California I Male .07 940 710California I Female .06 867 694California II Male .10 674 602California II Female .06 569 478TOTAL .06 10762 9579

1 9 7 5Charlotte Male .07 858 610Charlotte Female .06 870 567Birmingham Male .08 2004 1717Birmingham Female .07 1950 1424NYC Male .06 510 491NYC Female .05 465 467Utah Male .06 796 667Utah Female .05 760 651California I Male .11 1061 589California I Female .07 970 591California II Male .11 718 558California II Female .06 598 449TOTAL .07 11560 8781

234

of 1+ asthma in 1913. Individually, missing data

percentages range from forty percent for California II

females to seventy percent for NYC males. This may be

considered entirely too much missing data from any type of

conservative standpoint. However, the sample sizes on which

the estimates are based are qUite large, ranging from 291 to

1958, and this much existing data may serve to offset the

disadvantages of so much incomplete data. The only striking

trend among the estimates would appear to be that males

consistently report more asthma than females. This is in

contrast to previous findings for colds, where females

consistently reported more incidents than males. These

estimates and their standard errors are also reported in

Table 6.5.1.(.r. ""'"" _ ...,... •

Let Yi = (Yi1'Yi2'Yi3) represent the multivar~ate

ratio estimator of the proportion reporting 1+ asthma in

1913, 1914, and 1915 (actually this is the mean estimate of

the (0,1) indicator variable indicating the presence or

absence of asthma symptoms in 1913, 1914, and 1915),

(i=1,2, ... 12) where i=l indicates Charlotte males, 2

indicates Charlotte females, and so on up to i=12 for

California II females.)...,..Yik is calculated via expression

(6.4.1) of the previous section, and the covariance matrix

Vy is calculated directly from (6.4.3). Consequently,

...,.., - .......forming y = (Y1'Y2' ...Y12)', the variation among the

estimates of the proportions with 1+ asthma can be modeled

with

235

'"E {y} = ~ = X~A'"where ~ is the expected value of y, X is the specification

matrix of interest and ~ is the unknown parameter vector.

The estimated covariance matrix fory is written Vyand is

the diagonal block matrix with the i-th block taken from

expression (6.4.3). The first model of interest is the

identity model. which can be formally stated:

V'

EA{y} = X~ = I~ = ~

Linear hypotheses concerning the elements of ~ were

subsequently tested via Wald statistics. The results of

such testing are displayed in Table 6.5.2. There is no

three-way interaction, and the only significant (a=.05 level

of significance) two-way interaction is that of area and

sex. Qc for this test is Qc = 12.17(d.f.=5} with a p-value

of .03.

These results indicate that a promising model

structure to investigate is the one in which predicted

reference values are used for each area, incremental effects

for females are used for each area, and effects for 1974 and

1975 apply to all. This model can be stated:

EA{y} = ~ = X2~

and the resulting parameter estimates and standard errors

are displayed in Table 6.5.3, along with the specification

matrix. The Wald goodness-of-fit statistic for this model

is Qw = 14.04 (d.f.=22), which has a p-value of .90.

The table also includes the results of linear hypotheses

concerning the elements of the parameter vector~. The test

Table 6.5.2

Hypotheses and Resulting Test Statistics ConcerningProportions of 1+ Asthma Reported in 1973, 1974 and 1975

Hypothesis

1. No difference between sexes foraverages over area*time

2. No variation among areas foraverages over sex*time

3. No variation among times foraverages over sex*time

4. Homogeneity across areas of differencebetween sexes for average across time

5. Homogeneity across time of differencesbetween sexes for averages across area

6. Homogeneity across area for differencebetween time for averages across sex

7. No race*time*area interaction

QC P-value df

1. 31.07 .0000 12. 32.77 .0000 53. 24.42 .0000 24. 12.17 .0325 55. 0.27 .8700 26. 8.55 .5753 107. 5.12 .8830 10

236

237

Table 6.5.3

Specification Matrix, Estimated Parameters and StandardErrors for Model X2 for the Probability of

Reporting Asthma in 1973, 1974 and 1975


1 0 0 0 0 0 0 0 000 0001 0 0 0 0 0 0 0 0 0 0 0 1 01 0 0 0 0 0 0 000 0 0 0 11 1 0 0 0 0 0 0 0 0 0 0 0 01 100 0 0 0 0 0 0 0 0 1 01 1 0 0 0 0 0 0 0 0 000 1o 0 1 0 0 0 0 0 0 0 0 0 0 0o 0 1 0 0 0 0 0 0 0 001 0o 0 1 0 0 0 0 0 0 0 0 001001 1 0 000 0 0 0 0 0 0001 1 000 0 0 0 0 0 1 0001 1 000 0 0 0 000 1000 0 1 000 0 0 0 0 0 0000 0 1 0 0 0 0 000 1 0000 0 1 0 0 0 000 0 0 1o 0 001 100 0 0 0 0 0 000001 100 0 0 0 0 1 0o 0 001 1 0 000 0 0 0 100000 0 1 0 0 0 0 0 0 0000 0 0 0 1 000 0 0 1 0000 0 0 0 1 0 0 0 0 0 0 1o 0 0 0 0 0 1 1 0 0 0 0 0 0000 0 001 1 000 0 1 0000 0 0 0 1 1 0 0 0 0 0 1o 0 0 0 0 000 1 0 0 0 0 0o 0 0 0 0 0 0 0 1 000 1 0o 0 0 0 0 0 0 0 1 000 0 100 0 0 0 0 0 0 1 1 0 0 0 0000 0 0 000 1 100 1 0000 0 0 0 0 0 1 100 0 1o 0 0 0 000 0 0 0 1 000o 0 0 0 0 0 0 0 0 0 1 0 1 0000 000 0 0 0 0 1 001000 0 0 0 0 0 0 0 1 10000000 0 0 0 0 0 1 1 1 0o 0 0 0 0 0 0 000 1 101


1234567

•91 01 11 21 314


.0589 ± .0062-.0123 ± .0083

.0620 ± .0044-.0111 ± .0054

.0525 ± .0074-.0092 ± .0097

.0506 ± .0057-.0110 ± .0080

.0854 ± .0071-.0298 ± .0090

.0968 ± .0085-.0466 ± .0110-.0026 ± .0027

.0139 ± .0030

1:31' 1:33' 1:38' 1:37' 1:39' 1:31 1 : Predicted values for 1973 forCharlotte, Birmingham, NYC, Utah, California I and II

1:32,1:34,1:36,1:3., ~IO' 1:312: Incremental effect for femalesfor Charlotte, Birmingham, NYC, Utah, California I and II

1:313: Incremental effect for 19741:314: Incremental effect for 1975

Linear Hypothesis QC P-value df

H1 : 1:3 = 1:3 4 = 1:3 6 = ~8 .059 .962 32 1.101 .294 1H2 : 1:3 10 = 1:3 12H3 : 1:3 13 .= 0 .915 .339 1

238

investigating whether the tim7 effect for 1974 is null,

HO:~13=O, can be assessed with the Wald statistic QC =

b'C'(CVbC,)-lCb where b is the estimate for ~, Vb is its

estimated covariance matrix, and C is the contrast matrix

C = [ 0 0 0 0 0 0 000 0 1 0 ]

For the hypothesis H3 , Qc = .915 (1 d.f.) and its p-value is

.34. Additional hypotheses examined were

or the equivalence of the sex effects for Charlotte,

Birmingham, NYC and Utah, and

H2 : ~10 = ~12

the equivalence of the sex effects for California I and

California II. Neither of these tests are contradicted, and

the model X3 incorporating the implied parameter smoothing

is displayed in Table 6.5.4.

This nine parameter model includes the six area

predicted reference values, one incremental effect for sex

for California I and California II, another incremental sex

effect for the other four areas, and an overall incremental

effect for 1975. The goodness-of-fit statistic for the

model is QW = 16.45 (d.f.=21) and its p-value is .94. The

linear hypotheses concerning the area reference parameters

can be stated:

and were tested with a C matrix of the form

239

Table 6.5.4

Specification Matrix, Estimated Parameters and StandardErrors for Model X3 for the Probability of



1 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 11 0 0 0 0 0 1 0 01 0 0 0 0 0 1 0 01 0 0 0 0 0 1 0 10 1 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 1 0 0 0 0 0 0 10 1 0 0 0 0 1 0 o·0 1 0 0 0 0 1 0 00 1 0 0 0 0 1 0 10 0 1 0 0 0 0 0 00 0 1 0 0 0 0 0 00 0 1 0 0 0 0 0 10 0 1 0 0 0 1 0 00 0 1 0 0 0 1 0 00 0 1 0 0 0 1 0 10 0 0 1 0 0 0 0 00 0 0 1 0 0 0 0 00 0 0 1 0 0 0 0 10 0 0 1 0 0 1 0 00 0 0 1 0 0 1 0 00 0 0 1 0 0 1 0 1 e0 0 0 0 1 0 0 0 00 0 0 0 1 0 0 0 00 0 0 0 1 0 0 0 10 0 0 0 1 0 0 1 00 0 0 0 1 0 0 1 00 0 0 0 1 0 0 1 10 0 0 0 0 1 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 1 0 0 10 0 0 0 0 1 0 1 00 0 0 0 0 1 0 1 00 0 0 0 0 1 0 1 1


SI : Predicted value for 1973 for Charlotte .0570 ± .0046Sz : Predicted value for 1973 for Birmingham .0605 ± .0034S3 : Predicted value for 1973 for NYC .0518 ± .0053S4 : Predicted value for 1973 for Utah .0492 ± .0044Ss : Predicted value for 1973 for Cal I .0881 ± .0061Ss : Predicted value for 1973 for Cal II -.0896 ± .0067S7 : Incremental effect for females for Utah, -.0111 ± .0037

Birmingham, NYC and CharlotteS. : Incr. effect for females for Cal I « II -.0366 ± .0069Se : Incremental effect for 1975 .0156 ± .0024

Q = 16.45 p-value = .9438 d. f. = 27w

Linear Hypotheses Q p-value d.f.SI=SZ' S3 =S4 , Ss =Ss .73 .867 3

240

r ,I 1-1 o 0 0 0 0 0 0 I

C = I I• I 0 0 1-1 0 0 0 0 0 I

I II 0 0 o 0 1-1 0 0 0 IL ~

and found to be reasonable. (QC = .13,d.f.=3, p-value=.861)

The constraint imposed by H4 was incorporated into a

final model. This can be stated as:

'"EA{Y} = X4Pwhere X4 is the (24 x 6) specification matrix displayed in

Table 6.5.5, and P is the (6 x 1) parameter vector, the

estimates for which are also shown in Table 6.5.5, along

with their standard errors. The Wald goodness-of-fit

statistic for this model is Qw = 11.38 (d.f. = 30, p

value=.91). There is a noticeable difference in the values

of the area predicted reference values for 1+ asthma in

1913. The values of .081 for the two California areas is

much higher than the other estimated values of .0594 and

.0502. The incremental effects for females are both

negative in direction, with the effect for California being

much larger in magnitude than that for the other areas and

thus serving to bring the female estimates of 1+ asthma in

California more in line with those in the other areas. The

remaining parameter is the average incremental effect for

1915. Predicted and observed proportions are displayed in

Table 6.5.5

Specification Matrix, Estimated Parameters and StandardErrors for Final Model X. for the Probability of


241


1 0 0 0 0 01 0 0 0 0 01 0 0 0 0 11 0 0 1 0 01 0 0 1 0 01 0 0 1 0 11 0 0 0 0 01 0 0 0 0 01 0 0 0 0 11 0 0 1 0 01 0 0 1 0 01 0 0 1 0 10 1 0 0 0 00 1 0 0 0 00 1 0 0 0 10 1 0 1 0 00 1 0 1 0 00 1 0 1 0 10 1 0 0 0 00 1 0 0 0 00 1 0 0 0 10 1 0 1 0 00 1 0 1 0 0 e0 1 0 1 0 10 0 1 0 0 00 0 1 0 0 00 0 1 0 0 10 0 1 0 1 00 0 1 0 1 00 0 1 0 1 10 0 1 0 0 00 0 1 0 0 00 0 1 0 0 10 0 1 0 1 00 0 1 0 1 00 0 1 0 1 1


131 : Predicted value for Birm and Char .0594 ± .0031I3z : Predicted value for NYC and Utah .0502 ± .0036133 : Predicted value for California I and II .0887 ± .0054134 : Incremental effect for females -.0110 ± .0037136 : Incr. effect for females for Cal I « II -.0366 ± .0069136 : Incremental effect for 1975 .0156 ± .0024

QW = 17.18 P-value = .9704 d. f. = 30

242

Table 6.5.6, along with the corresponding standard errors.

When this model is compared to the model for 1+ asthma

discussed in Chapter V and illustrated in Table 6.5.5,

certain similarities emerge. The parameters for California

I and California II can be combined, a situation that has

not occurred for any of the analyses for colds. Also, there

is no inclusion of time effects for 1974 in the model for

the complete data analys"is, and the 1975 effect included is

averaged over all areas as in the present analysis, although

It is restricted to males. Thus, even though the current

analysis is based on roughly twice as many observations per

year's estimates, there are similar patterns in each

analysis.

Table 6.5.6

Observed and Predicted Proportions of ReportingAsthma in 1973, 1974 and 1975

242

Observed Predicted

Area Sex 1973 1974 1975 1973 1974 1975

Charlotte M .0556 .0600 .0734 .0594 .0594 .0750Charlotte F .0432 .0480 .0609 .0484 .0484 .0640Birmingham M .0663 .0567 .0753 .0594 .0594 .0750Birmingham F .0496 .0471 .0682 .0484 .0484 .0640NYC M .0539 .0567 .0588 .0502 .0502 .0658NYC F .0493 .0441 .0495 .0392 .0392 .0548Utah M .0499 .0506 .0628 .0502 .0502 .0658Utah F .0444 .0338 .0526 .0392 .0392 .0548California I M .0859 .0745 .0110 .0887 .0887 .0104California I F .0534 .0554 .0701 .0522 .0522 .0677 eCalifornia II M .0932 .1009 .0110 .0887 .0887 .1043California II F .0492 .0580 .0585 .0522 .0522 .0677

Standard Errors

Observed Predicted

Charlotte M .0090 .0094 .0089 .0031 .0031 .0034Charlotte F .0078 .0084 .0081 .0029 .0029 .0033Birmingham M .0063 .0052 .0059 .0031 .0031 .0034Birmingham F .0057 .0049 .0057 .0029 .0029 .0034NYC M .0131 .0099 .0104 .0036 .0036 .0039NYC F .0124 .0092 .0101 .0037 .0037 .0040Utah M .0077 .0078 .0086 .0036 .0036 .0039Utah F .0072 .0065 .0081 .0037 .0037 .0040California I M .0089 .0086 .0096 .0054 .0054 .0056California I F .0074 .0078 .0082 .0044 .0044 .0047California II M .0108 .0116 .0117 .0054 .0054 .0056California II F ~0086 .0098 .0096 .0044 .0044 .0047

CHAPTER VII

DISCUSSION

This dissertation has attempted to illustrate an

integrated approach to the analysis of categorical data,

especially when the dataset is large and the analytic

objectives ambiguous in terms of pre-stated hypotheses. The

need for variable selection is especially important when the

dataset at hand is large, as it can be physically and

computationally impossible to include all possible variables

in all desired analyses. The variable selection procedure

described in Chapter IV is very suitable for use with a

large dataset with a large number of candidate variables;

the extension of this procedure to include multivariate

randomization statistics to evaluate associations of

candidate variables with mUltiple response variables is

useful in a repeated measurements situation when there is a

desire to continue analysis efforts with multivariate linear

models methods. The variable selection performed in Chapter

IV was strictly mechanical; certainly the analyst may have

substantive information which may make it reasonable to

include one or more explanatory variables in subsequent

analyses regardless of the results of any variable selection

scheme. For example, even though the two level variable

RACE2 emerged as a variable to 'include' from the variable

244

selection for the outcome variable 2+ colds, one may have

thought it more prudent to use the three level variable

RACE3 to investigate race effects. However, for those

situations where one has no a priori ideas on which

variables to include, the scheme presented in Chapter IV

would appear to be very reasonable. In linear models

analysis of categorical data, as has been pointed out in

previous chapters, one is limited in the number of

independent variables one can model at one time, so a good

variable selection process is perhaps more important here

than in other data analysis situations where the outcome

measures are not categorical.

The second phase of analysis is the development of

linear models to describe the variation among the estimates

of interest for the subpopulations formed on the basis of

the independent variables selected in the first phase.

Weighted least squares regression is applied to produce

parameter estimates for the models under consideration. The

first model applied is the identity model, which provides

the framework for assessing linear hypotheses concerning

potential sources of variation. Once these preliminary

sources of variation are assessed, one can select an

appropriate model structure for further modeling efforts.

Once such an intermediate model is fitted and assessed as

appropriate, additional model reduction can continue through

the process of hypothesis testing and implied model fitting.

This thesis demonstrated a few directions that linear models

•

245

models can take. One of these is producing parameter

estimates for one model, and then sUbsequen~ly fitting

another model to those parameter estimates, which was done

in the section on supplemental margins. The section on

supplemental margins also demonstrated that estimates and

covariance matrices from analyses performed on separate

sUbpopulations can be put together so that another dimension

of modeling can be pursued. The usefulness of residual

analysis in categorical data analysis is also illustrated in

some of the analysis examples.

It is important to note that all of the analysis

performed in this work served to provide an appropriate

description of the variation in the CHESS dataset. The

linear models produced were intended as a descriptive tool

rather than an inferential device. Thus, the multiple

hypothesis tests performed at all the modeling stages also

need to be seen in the context of a descriptive analysis

intended to provide a good description of the data

themselves rather than as tests of statistical inference.

The significance levels adhered to throughout the analysis

chapters were used as guidelines for decisions regarding the

directions of modeling efforts. Resulting models should

thus be thought of as applying to this dataset only; there

is no basis for inferring results to any other population.

Thus, in terms of the discussion at the beginning of Chapter

II, this is strictly a local population analysis.

The analysis results appear to indicate a few

246

consistent patterns which hold across the individual

analyses of the varying data subsets examined. Males

reported more asthma than females, while females

consistently reported more colds than males. There were

always area differences, although the nature of those

differences varied. Time was a source of variation as well,

although it usually manifested itself as an interaction with

either sex or area. The pollution index was not included in

the linear models efforts, as it did not prove to be

associated with either response variables or explanatory

variables when randomization tests were performed in the

variable selection process. Thus, the possible effects due

to pollution may be part of the reason for area differences.

The missing data strategies applied to this dataset

were functional and useful, but mechanically difficult.

Supplemental margins has its advantages over multivariate

ratio estimation in that it results in weighted estimates

with reduced variances, and also allows one to get goodness

of-fit tests along the way, but it was very time-consuming

to implement and involved several different stages. Even

more stages are required if one feels the necessity of

element-wise deletion and case-wise deletion discussed in

the supplemental margins section. Multivariate ratio

estimation is easier to apply but its requirement of

indicator functions to denote whether a value is present or

not for a particular response measure reduces by one half

the maximum number of functions that one could include in

247

the function vector being modeled. Thus, the analysis of

mean colds in section 6.4 had to be split up into separate

analyses for the· sexes. As pointed out in Chapter VI,

multivariate ratio estimation was entirely adequate for use

as a missing data adjuster in this dataset, as the estimates

produced by the two procedures for section 6.3 were nearly

identical, the only real difference being that the variances

were somewhat lower for supplemental margins. However, that

might not be the case in a more moderately-sized dataset.

Additional work is needed to provide software to

automate these missing data adjustment procedures for

categorical data, and to increase their applicability in

terms of numbers of functions they are able to handle. It

would be interesting to research more fully the implications

of using supplemental margins versus multivariate ratio

estimation,i.e. is there a point in the sample sizes

involved where the reduced precision offered by multivariate

ratio estimation becomes a problem? It would have become

computationally impossible to duplicate this analysis if

there were four times involved instead of three. Easily

implemented software which would allow the presence of more

functions is also thus required from the point of view of

repeated measurements.

248

BIBLIOGRAPHY

Bates, P. V. (1967). Air pollution and chronic bronchitis.Archives of Environmental Health, 14, 220.

Bhapkar, V. P. (1966). A note on the equivalence of twotest criteria for hypotheses in categorical data.Journal of the American Statistical Association, 61,228-235.

Buechley, R. W., Riggan, W. B., Hasselblad, V. and VanBruggen, J. B. (1973). S02 levels and perturbationsin mortality. A Study in the New York-New JerseyMetropolis. Archives of Environmental Health, 27, 134.

Chapman, R. S., Shy, C. M., Finlea, J. F., House, D. E.,Goldberg, H. E. and Hayes, C. G. (1973). Chronicrespiratory disease in military inductees and parentsof schoolchildren. Archives of Environmental Health,27, 138.

Clarke, S. H. and Koch, G. G. (1976). The effect of incomeand other factors on whether criminal defendents go toprison. The Law and Society Review, 11, 57-92.

Cochran, W. G. (1954). Some methods of strengthening thecommon z test. Biometrics, 10, 417-451.

Cook, N.R. and Ware, J. H. (1980). Design and analysismethods for longitudinal research. Annual Reviewof Public Health, 4, 1-24.

Cornfield, J. (1944). On samples from finite populations.Journal of the American Statistical Association,39, 136-239.

Dohan, F. C. (1961). Air pollutants and incidence ofrespiratory disease. Archives of EnvironmentalHealth, 3, 387.

Dohan, F. C. and Taylor, E. W. (1960). Air pollution andrespiratory disease, a preliminary report. AmericanJournal of Medical Science, 240, 337.

Douglas, J. and Waller, R. (1966). Air pollution andrespiratory disease in children. British Journal ofPreventative Social Medicine, 20, 1-8.

249

Ferris, B. (1970). Effects of air pollution on schoolabsences and differences in lung function in first andsecond graders in Berlin, New Hampshire, January 1966to June 1967. American Review of Respiratory Disease,102, 591-607.

Forthofer, R. N. and Koch, G. G. (1973). An analysis forcompounded functions of categorical data. Biometrics,29, 143-157.

Gleason, T. C. and Staelin, R. (1975). A proposal forhandling missing data. Pyschometrika, 40, 229-252.

Glasser, M., Greenberg, L. and Field, F. (1976). Mortalityand morbidity during a period of high levels of airpollution. New York, November 23 to 25, 1965.Archives of Environmental Health, 15, 684.

Gould, A. L. (1980). A new approach to the analysis ofclinical drug trials with withdrawals. Biometrics 36,721-727.

Greenberg, L., Field, F., Reed, J. I. and Erhardt, C. L.(1964). Asthma and temperature change. Anepidemiological study in three large New Yorkhospitals. Archives of Environmental Health, 8, 642.

Greenhouse, S. W. and Geisser, S. (1959). On methods inthe analysis of profile data. Psychometrika 24,112.

Grizzle, J. E., Starmer, C. F. and Koch, G. G. (1969). Theanalysis of categorical data by linear models.Biometrics, 25, 489-504.

Hasselblad, V., Nelson, W. C. and Lowrimore, G.R. (1974).Analysis of Effects Data: Some Results and Problems inin Statistical and Mathematical Aspects of PollutionProblems. ed. John N. Pratt. Marcel Dekker, Inc.

Herman, Stewart W. (1977) The health costs of air pollution:A survey of studies published between 1967 and 1977.American Lung Association, 1740 Broadway, New York,N.Y. 1001~

Higgins, J. E. and Koch, G. G. (1977). Variable selectionand generalized chi-squared analysis of categoricaldata applied to a large cross-sectionaloccupational health survey.International Statistical Review 45, 51-62.

Hopkins, C. E. and Gross, A. J. (1971). A generalizationof Cochran's procedure for the combining of r x ccontingency tables. Statistica Neerlandica 25, 57-62.

250

Huynh, H. and Feldt, L. S. (1976). Estimation of the Boxcorrection for degrees of freedom from sample data inrandomized block and split-plot designs. Journal ofEducational Statistics 1, 69-82.

Kleinbaum, D. G. (1970). Estimation and hypothesis testingfor generalized multivariate linear models.University of North Carolina Institute of StatisticsMimeo Series, No. 609.

Koch, G. G. (1969). A useful lemma for proving theequality of two matrices with applications toleast squares type quadratic forms.Journal of the "American StatisticalAssociation, 64, 969-970.

Koch, G. G., Amara, I. A., Stokes, M. E. and Gillings, D. B.(1980). Some views on some parametric and nonparametric analysis for repeated measurements andselected bibliography. International StatisticalReview 48, 249-265.

Koch, G. G., Elashoff, J. and Amara, I. A. (1985).Repeated Measurement Studies, Design and Analysis.In Encyclopedia of Statistical Sciences,N. L. Johnson and S. Kotz, eds. 457-472,Wiley, New York.

Koch, G. G., Gillings, D. B. (1983). Inference, designbased vs: model based. In EncvcloDedia of StatisticalSciences 4, N. L. Johnson and S. Kotz (eds.) 84-88,Wiley, New York.

Koch, G. G., Gillings, D. B. and Stokes, M. E. (1980).Biostatistical implications of design, sampling andmeasurement to the analysis of health science data.Annual Review of Public Health 1, 163-225.

Koch, G. G., Imrey, P. B. and Reinfurt, D. W. (1972).Linear model analysis of categorical datawith incomplete response vectors. Biometrics 28, 663692.

Koch, G. G., Imrey, P. B., Singer, J. M., Atkinson, S. S.and Stokes, M. E. (1985). Analysisof Categorical Data. University of Montreal Press.

Koch, G. G., Johnson, W. and Tolley, D. (1972). A linearmodels approach to the analysis of survival and extentof disease in multidimensional contingency tables.Journal of the American Statistical Association 72,783-796.

•

251

Koch, G. G., Landis, J. R., Freeman, J. L., Freeman, D. H.,Jr. and Lehnen, R. G. (1977). A general methodologyfor the analysis of experiments with repeatedmeasurement of categorical data. Biometrics 33, 133158.

Koch, G. G. and Reinfurt, D. N. (1974). An analysis of therelationship between driver injury and vehicle age forautomobiles involved in North Carolina accidentsduring 1966-1970. Accident Analysis and Prevention.6, 1-18.

Laird, N. and Ware, J. (1982). Random effects forlongitudinal data. Biometrics 38, 963-974.

Lambert, P. M. and Reid, D. D. (1970). Smoking, airpollution, and bronchitis in Britain. Lancet, 1, 853.

Landis, J. R., Heyman, E. R. and Koch, G. G. (1978).Average partial association in three-way contingencytables: a review and discussion of alternate tests.International Statistics Review 46, 237-254.

Landis, J. R. and Koch, G. G. (1979). The analysis ofcategorical data in longitudinal studies of behavioraldevelopment. Chapter 9 in Longitudinal Research inthe Study of Behavior and Development, edited by J. R.Nesselroade and P. B. Baltes. Academic Press, NewYork, 233-261.

Landis, J. R., Stanish, W. M., Freeman, J. L. and Koch, G. G(1976). A computer program for the generalized chisquare analysis of categorical data using weightedleast squares (GENCAT). Computer Programs inBiomedicine 6, 196-231.

Lehnen, R. G, and Koch, G. G. (1974). The analysis ofcategorical data from repeated measurement researchdesigns. Political Methodology, 1, 103-123.

Levy, D., Gent, M. and Newhouse, M. T. (1977).Relationship between acute respiratory disease andair pollution levels in an industrial city .American Review of Respiratory Disease, 116, 167.

Lunn, J. E., Knowelden, J. and Handyside, A. Fatterns ofrespiratory illness in Sheffield schoolchildren.British Journal of Preventative Medicine, 21, 7-16.

MaCarroll, J. and Bradley, W. (1966). Excess mortality asan indicator of health effects of air pollution.American Journal of Public Health, 56, 1933.

252

Mantel, W., and Haenszel, N. (1959). Statistical aspects ofthe analysis of data from retrospective studies ofdisease. Journal of the National Cancer Institute.22, 719-748.

Mantel, N. (1963). Chi-square tests with one degree offreedom; extensions of the Mantel-Haenszel procedure.Journal of the American Statistical Association 58,690-700.

Martin, A. E. (1964). Mortality and morbidity statisticsand air pollution. Proceedings of the Royal Societyof Medicine, 57, 969.

Morrison, D. (1976). Multivariate Statistical Methods.MaGraw-Hill Book Company, New York, 2nd Edition.

Mosher, W. E. (1970). An effect of continued exposure toair pollution on the incidence of chronic childhoodallergic disease. American Journal of Public Health,60, 891.

Mostardi, R. A., Woebkenberg, N. R., Ely, D. L., Conlon, M.M. and Atwood, G. (1981). The University of Akronstudy on air pollution and human health effects II.effects on acute respiratory illness. Archives ofEnvironmental Health, 36, 5.

Muller, K. E., Smith, J. C. and Shy, C. M. (1981).Relationship between air pollution and children'spUlmonary function in six areas in the United States.Contract 68-02-2763 with EPA.

Neyman, J. (1949). Contributions to the theory of the ztest. In Proceedings of the Berkeley Symposium onMathematical Statistics and Probability. J. Neyman(ed.) University of California Press, Berkeley, 239273.

Puri, M. L. and Sen, P. K. (1971). Non-parametric Methodsin Multivariate Analysis. Wiley and Sons, New York.

Rao, M., Steiner, P., Qazi,Steiner, M. (1973).attack rate of asthmaResearch, 11. 73.

Q., Padre, R., Allen, J. E. andRelationship of air pollution toin children. Journal of Asthma

Schenk, H. H., Heimann, H., Clayton, G. D., Gafafer, W. andWexler, H. (1949). Air pollution in Donora,Pennsylvannia. Epidemiology of the unusual smogepisode of October 1948, Public Health Bulletin 306,U.S. Government Printing Office, Washington, D.C.

253

Shy, C., Goldsmith, 3., Hackney, 3., Lebowitz, M. andMenzel, D. (1978). Health Effects of AirPollution: Review for the American LungAssociation. American Lung Association.

C.· 3. and•

•

Shy, C. M., Hasselblad, V., Burton, R. M., Nelson,Cohen, A. (1973). Air'Pollution Effects onVentilatory Function of U.S. Schoolchildren.of Studies in Cincinnati, Chattanooga and NewArchives of Environmental Health, 27, 124.

ResultsYork,

Stanish, W. M. (1978). Adjustment for covariates incategorical variable selection and in multivariatepartial association tests. Unpublished dissertation,UNC /Chapel Hill.

Stanish, W. M., Gillings, D. B. and Koch, G. G.application of multivariate ratio methodsanalysis of a longitudinal clinical trialdata. Biometrics, 34, 305-317.

(1978). Anfor thewith missing

•

Status of the Community Health and EnvironmentalSurveillance System (CHESS). (November, 1980).Report to the U.S. House of RepresentativesCommittee on Science and Technology.EPA-600 \ 1-80-033. Office of Researchand Development U.S. Environmental Protection Agency,Washington, D.C. 20460

Sultz, H. A., Feldman, 3. G., Schlesinger, E. R. and Mosher,W. E. (1970). An effect of continued exposure to airpollution on the incidence of chronic childhoodallergic disease. American 30urnal of Public Health,60, 891.

Timm, N. (1975). MUltivariate Methods with Applications inEducation and Psychology. Brooks, Cole PublishingCompany, Monterey, California.

Timm, N. (1980). Multivariate Analysis of Variance ofRepeated Measurements, In P. R. Krishnaiah, ed.Handbook Of Statistics, Vol. I. North HollandPublishing Company, 41-87.

Toyama, T. (1964). Air pollution and its health effects in3apan. Archives of Environmental Health, 8, 153.

Verma, M. P., Schilling, F. 3. and Becker, N. H. (1969).Epidemiological study of illness absences in relationto air pollution. Archives of Environmental Health,18, 536.

254

Wald, A. (1943). Tests of statistical hypothesesconcerning several parameters when the number ofobservations is large. Transactions of theAmerican Mathematical Society. 54, 426-482.

Ware, J. H. (1985). Linear models for the analysis oflongitudinal studies. The American Statistician, 39,2 .

Whittemore, A. S. and Korn, E. L. (1980). Asthma and airpollution in the Los Angeles area. American Journalof Public Health, 70, 7.

Yoshida, R., Motomiya, K., Saito, H. and Funabashi, S.(1976). Clinical and epidemiological studies onchildhood asthma and air polluted areas. In ClinicalImplications of Air Pollution Research,Asher J. Finkel and Ward C. Duel., eds.Publishing Science Group, Inc., Acton Mass.,165-176.

•

.

•

Documents

€¦ · • ABSTRACT MAURA ELLEN STOKES. The Application of an Integrated Set of Categorical Analysis Methods to a Large Environmental Dataset with Repeated Measures and Partially