Upload
marcus
View
77
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Discrete Multivariate Analysis. Analysis of Multivariate Categorical Data. References. Fienberg, S. (1980), Analysis of Cross-Classified Data , MIT Press, Cambridge, Mass. Fingelton, B. (1984), Models for Category Counts , Cambridge University Press. - PowerPoint PPT Presentation
Citation preview
Discrete Multivariate Analysis
Analysis of Multivariate Categorical Data
References
1. Fienberg, S. (1980), Analysis of Cross-Classified Data , MIT Press, Cambridge, Mass.
2. Fingelton, B. (1984), Models for Category Counts , Cambridge University Press.
3. Alan Agresti (1990) Categorical Data Analysis, Wiley, New York.
Example 1
Data Set #1 - A two-way frequency table Serum Systolic Blood pressure
Cholesterol <127 127-146 147-166 167+ Total <200 117 121 47 22 307 200-219 85 98 43 20 246 220-259 119 209 68 43 439 260+ 67 99 46 33 245 Total 388 527 204 118 1237
In this study we examine n = 1237 individuals measuring X, Systolic Blood Pressure and Y, Serum Cholesterol
Example 2
The following data was taken from a study of parole success involving 5587 parolees in Ohio between 1965 and 1972 (a ten percent sample of all parolees during this period).
The study involved a dichotomous response Y– Success (no major parole violation) or – Failure (returned to prison either as technical
violators or with a new conviction)
based on a one-year follow-up.The predictors of parole success included are:
1. type of committed offence (Person offense or Other offense),
2. Age (25 or Older or Under 25), 3. Prior Record (No prior sentence or Prior
Sentence), and 4. Drug or Alcohol Dependency (No drug or
Alcohol dependency or Drug and/or Alcohol dependency).
• The data were randomly split into two parts. The counts for each part are displayed in the table, with those for the second part in parentheses.
• The second part of the data was set aside for a validation study of the model to be fitted in the first part.
Table
No drug or alcohol dependency Drug and/or alcohol dependency 25 or older Under 25 25 or Older Under 25 Person
offense Other
offense Person offense
Other offense
Person offense
Other offense
Person offense
Other offense
No prior Sentence of Any Kind Success 48 34 37 49 48 28 35 57 (44) (34) (29) (58) (47) (38) (37) (53) Failure 1 5 7 11 3 8 5 18 (1) (7) (7) (5) (1) (2) (4) (24) Prior Sentence Success 117 259 131 319 197 435 107 291 (111) (253) (131) (320) (202) (392) (103) (294) Failure 23 61 20 89 38 194 27 101 (27) (55) (25) (93) (46) (215) (34) (102)
Multiway Frequency Tables
• Two-Way
A
B
• Three -Way
A
B
C
A
B
C
• Three -Way
• four -Way
A
B
C
D
Analysis of a Two-way Frequency Table:
Frequency Distribution (Serum Cholesterol and Systolic Blood Pressure)
Serum Systolic Blood pressure Cholesterol <127 127-146 147-166 167+ Total
<200 117 121 47 22 307 200-219 85 98 43 20 246 220-259 119 209 68 43 439
260+ 67 99 46 33 245 Total 388 527 204 118 1237
Joint and Marginal Distributions (Serum Cholesterol and Systolic Blood Pressure)
Serum Systolic Blood pressure Marginal distn Cholesterol <127 127-146 147-166 167+ (Serum Chol.)
<200 9.46 9.78 3.80 1.78 24.82 200-219 6.87 7.92 3.48 1.62 19.89 220-259 9.62 16.90 5.50 3.48 35.49
260+ 5.42 8.00 3.72 2.67 19.81 Marginal distn (BP)
31.37 42.60 16.49 9.54 100.00
The Marginal distributions allow you to look at the effect of one variable, ignoring the other. The joint distribution allows you to look at the two variables simultaneously.
Conditional Distributions ( Systolic Blood Pressure given Serum Cholesterol )
The conditional distribution allows you to look at the effect of one variable, when the other variable is held fixed or known.
Serum Systolic Blood pressure Cholesterol <127 127-146 147-166 167+ Total
<200 38.11 39.41 15.31 7.17 100.00 200-219 34.55 39.84 17.48 8.13 100.00 220-259 27.11 47.61 15.49 9.79 100.00
260+ 27.35 40.41 18.78 13.47 100.00 Marginal distn (BP)
31.37 42.60 16.49 9.54 100.00
Conditional Distributions (Serum Cholesterol given Systolic Blood Pressure)
Serum Systolic Blood pressure Marginal distn Cholesterol <127 127-146 147-166 167+ (Serum Chol.)
<200 30.15 22.96 23.04 18.64 24.82 200-219 21.91 18.60 21.08 16.95 19.89 220-259 30.67 39.66 33.33 36.44 35.49
260+ 17.27 18.79 22.55 27.97 19.81 Total 100.00 100.00 100.00 100.00 100.00
GRAPH: Conditional distributions of Systolic Blood Pressure given Serum Cholesterol
127-146 147-166<127 167+
SYSTOLIC BLOOD PRESSURE
<200
200-219
260+
220-259
Marginal Distribution
SERUM CHOLESTEROL
40%
50%
30%
20%
10%
Notation:
Let xij denote the frequency (no. of cases) where X (row variable) is i and Y (row variable) is j.
1
c
i i ijj
x R x
1
r
j j iji
x C x
1 1 1 1
r c r c
ij i ji j i j
x N x x x
Different Models
,ij P X i Y j
11 1211 12 11 12
11
, , , rcxx xrc rc
rc
Nf x x x
x x
The Multinomial Model:Here the total number of cases N is fixed and xij follows a multinomial distribution with parameters ij
11 1211 12
11
!! !
rcxx xrc
rc
Nx x
ij ij ijE x N
11 1211 12 1| 2| |
1 1
, , , ic
ri xx x
rc i i c ii i ic
Rf x x x
x x
The Product Multinomial Model:Here the row (or column) totals Ri are fixed and for a given row i, xij follows a multinomial distribution with parameters j|i
|ij ij i j iE x R
11 121 1
, , ,!
ij
ij
xr cij
rci j ij
f x x x ex
The Poisson Model:In this case we observe over a fixed period of time and all counts in the table (including Row, Column and overall totals) follow a Poisson distribution. Let ij
denote the mean of xij.
ij ijE x
!
ij
ij
xij
ij ijij
f x ex
Independence
Multinomial Model ,ij P X i Y j P X i P Y j
i j
ij ij i jN N
if independent
and
The estimated expected frequency in cell (i,j) in the case of independence is:
ˆ ˆ ˆ jiij ij i j
xxm N N
N N
i j i jx x R CN N
The same can be shown for the other two models – the Product Multinomial model and the Poisson model
namelyThe estimated expected frequency in cell (i,j) in the case of independence is:
ˆ i j i jij ij
R C x xm
N x
Standardized residuals are defined for each cell:
ij ijij
ij
x mr
m
The Chi-Square Statistic
2
2 2
1 1 1 1
r c r cij ij
iji j i j ij
x mr
m
The Chi-Square test for independence
Reject H0: independence if
2
2 2/ 2
1 1
1 1r c
ij ij
i j ij
x mdf r c
m
TableExpected frequencies, Observed frequencies,
Standardized Residuals
Serum Systolic Blood pressure Cholesterol <127 127-146 147-166 167+ Total
<200 96.29 130.79 50.63 29.29 307 (117) (121) (47) (22) 2.11 -0.86 -0.51 -1.35
200-219 77.16 104.80 40.47 23.47 246 (85) (98) (43) (20) 0.86 -0.66 0.38 -0.72
220-259 137.70 187.03 72.40 41.88 439 (119) (209) (68) (43) -1.59 1.61 -0.52 0.17
260+ 76.85 104.38 40.04 23.37 245 (67) (99) (46) (33) -1.12 -0.53 0.88 1.99
Total 388 527 204 118 1237 2 = 20.85 (p = 0.0133)
Example
In the example N = 57,407 cases in which individuals were victimized twice by crimes were studied.
The crime of the first victimization (X) and the crime of the second victimization (Y) were noted.
The data were tabulated on the following slide
Table 1: Frequencies
Second Victimization in Pair Ra A Ro PP/PS PL B HL MV Total Ra 26 50 11 6 82 39 48 11 273 A 65 2997 238 85 2553 1083 1349 216 8586
First Ro 12 279 197 36 459 197 221 47 1448 Victimization PP/PS 3 102 40 61 243 115 101 38 703
in pair PL 75 2628 413 229 12137 2658 3689 687 22516 B 52 1117 191 102 2649 3210 1973 301 9595 HL 42 1251 206 117 3757 1962 4646 391 12372 MV 3 221 51 24 678 301 367 269 1914 Total 278 8645 1347 660 22558 9565 12394 1960
Table 2: Expected Frequencies (assuming independence)
Ra A Ro PP/PS PL B HL MV TotalRa 1.32 41.11 6.41 3.14 107.27 45.49 58.94 9.32 273A 41.58 1292.98 201.46 98.71 3373.86 1430.58 1853.69 293.14 8586
Ro 7.01 218.06 33.98 16.65 568.99 241.26 312.62 49.44 1448PP/PS 3.40 105.87 16.50 8.08 276.24 117.13 151.78 24.00 703
PL 109.04 3390.72 528.32 258.86 8847.63 3751.56 4861.14 768.75 22516B 46.46 1444.92 225.14 110.31 3770.34 1598.69 2071.53 327.59 9595
HL 59.91 1863.12 290.30 142.24 4861.56 2061.39 2671.08 422.41 12372MV 9.27 288.23 44.91 22.00 752.10 318.91 413.23 65.35 1914
Total 278 8645 1347 660 22558 9565 12394 1960 57407
Table 3: Standardized residuals
Second Victimization in Pair Ra A Ro PP/PS PL B HL MV Ra 21.5 1.4 1.8 1.6 -2.4 -1.0 -1.9 0.6 A 3.6 47.4 2.6 -1.4 -14.1 -9.2 -11.7 -4.5
First Ro 1.9 4.1 28.0 4.7 -4.6 -2.8 -5.2 -0.3 Victimization PP/PS -0.2 -0.4 5.8 18.6 -2.0 -0.2 -4.1 2.9
in pair PL -3.3 -13.1 -5.0 -1.9 35.0 -17.9 -16.8 -2.9 B 0.8 -8.6 -2.3 -0.8 -18.3 40.3 -2.2 -1.5 HL -2.3 -14.2 -4.9 -2.1 -15.8 -2.2 38.2 -1.5 MV -2.1 -4.0 0.9 0.4 -2.7 -1.0 -2.3 25.2
11,430 (highly significant)
Table 3: Conditional distribution of second victimization given the first victimization (%)
Second Victimization in Pair Ra A Ro PP/PS PL B HL MV Ra 9.5 18.3 4.0 2.2 30.0 14.3 17.6 4.0 100.0 A 0.8 34.9 2.8 1.0 29.7 12.6 15.7 2.5 100.0
First Ro 0.8 19.3 13.6 2.5 31.7 13.6 15.3 3.2 100.0 Victimization PP/PS 0.4 14.5 5.7 8.7 34.6 16.4 14.4 5.4 100.0
in pair PL 0.3 11.7 1.8 1.0 53.9 11.8 16.4 3.1 100.0 B 0.5 11.6 2.0 1.1 27.6 33.5 20.6 3.1 100.0 HL 0.3 10.1 1.7 0.9 30.4 15.9 37.6 3.2 100.0 MV 0.2 11.5 2.7 1.3 35.4 15.7 19.2 14.1 100.0 Marginal 0.5 15.1 2.3 1.1 39.3 16.7 21.6 3.4 100.0
Log Linear Model
Recall, if the two variables, rows (X) and columns (Y) are independent then
ij ij i jN N
and
ln ln ln lnij i jN
In general let
1( ) 2( ) 12( , )ln ij i j i ju u u u
1 ln iji j
urc
1( )1 lni ij
j
u uc
2( )1 lnj ij
i
u ur
12( , ) 1( ) 2( )lni j ij i ju u u u
then
where1( ) 2( ) 12( , ) 12( , ) 0i j i j i j
i j i j
u u u u
(1)
Equation (1) is called the log-linear model for the frequencies xij.
Note: X and Y are independent if
1( ) 2( )ln ij i ju u u
In this case the log-linear model becomes
12( , ) 0 for all ,i ju i j
Comment:The log-linear model for a two-way frequency table:
is similar to the model for a two factor experiment
1( ) 2( ) 12( , )ln ij i j i ju u u u
ijji
ij jBiAy
and when ofmean the where
ijkij
ijkijjiijky
Three-way Frequency Tables
ExampleData from the Framingham Longitudinal Study of Coronary Heart Disease (Cornfield [1962])
Variables1. Systolic Blood Pressure (X)
– < 127, 127-146, 147-166, 167+
2. Serum Cholesterol– <200, 200-219, 220-259, 260+
3. Heart Disease– Present, Absent
The data is tabulated on the next slide
Three-way Frequency Table
Coronary Heart
Serum Cholesterol
Systolic Blood pressure (mm Hg)
Disease (mm/100 cc) <127 127-146 147-166 167+ <200 2 3 3 4
Present 200-219 3 2 0 3 220-259 8 11 6 6 260+ 7 12 11 11 <200 117 121 47 22
Absent 200-219 85 98 43 20 220-259 119 209 68 43 260+ 67 99 46 33
Log-Linear model for three-way tables
Let ijk denote the expected frequency in cell (i,j,k) of the table then in general
1( ) 2( ) 3( ) 12( , )ln ij i j k i ju u u u u
1( ) 2( ) 3( ) 12( , ) 12( , )0 i j k i j i ji j k i j
u u u u u
13( , ) 23( , ) 123( , , )i k j k i j ku u u
where
13( , ) 13( , ) 23( , ) 23( , )i k i k j k j ki k j k
u u u u 123( , , ) 123( , , ) 123( , , )i j k i j k i j k
i j k
u u u
Hierarchical Log-linear models for categorical Data
For three way tables
The hierarchical principle:If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction
1.Model: (All Main effects model)ln ijk = u + u1(i) + u2(j) + u3(k)
i.e. u12(i,j) = u13(i,k) = u23(j,k) = u123(i,j,k) = 0.
Notation:[1][2][3]
Description:Mutual independence between all three variables.
2.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j)
i.e. u13(i,k) = u23(j,k) = u123(i,j,k) = 0.
Notation:[12][3]
Description:Independence of Variable 3 with variables 1 and 2.
3.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u13(i,k)
i.e. u12(i,j) = u23(j,k) = u123(i,j,k) = 0.
Notation: [13][2]
Description:Independence of Variable 2 with variables 1 and 3.
4.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u23(j,k)
i.e. u12(i,j) = u13(i,k) = u123(i,j,k) = 0.
Notation: [23][1]
Description:Independence of Variable 3 with variables 1 and 2.
5.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k)
i.e. u23(j,k) = u123(i,j,k) = 0.
Notation:[12][13]
Description:Conditional independence between variables 2 and 3 given variable 1.
6.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u23(j,k)
i.e. u13(i,k) = u123(i,j,k) = 0.
Notation:[12][23]
Description:Conditional independence between variables 1 and 3 given variable 2.
7.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) + u23(j,k)
i.e. u12(i,j) = u123(i,j,k) = 0.
Notation: [13][23]
Description:Conditional independence between variables 1 and 2 given variable 3.
8.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k)
+ u23(j,k) i.e. u123(i,j,k) = 0.
Notation: [12][13][23]
Description:Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.
9.Model: (the saturated model)ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k)
+ u23(j,k) + u123(i,j,k)
Notation: [123]
Description:No simplifying dependence structure.
Hierarchical Log-linear models for 3 way table
Model Description[1][2][3] Mutual independence between all three variables.
[1][23] Independence of Variable 1 with variables 2 and 3.
[2][13] Independence of Variable 2 with variables 1 and 3.
[3][12] Independence of Variable 3 with variables 1 and 2.
[12][13] Conditional independence between variables 2 and 3 given variable 1.
[12][23] Conditional independence between variables 1 and 3 given variable 2.
[13][23] Conditional independence between variables 1 and 2 given variable 3.
[12][13] [23] Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.
[123] The saturated model
Maximum Likelihood Estimation
Log-Linear Model
For any Model it is possible to determine the maximum Likelihood Estimators of the parameters
Example Two-way table – independence – multinomial model
11 1211 12 11 12
11
, , , rcxx xrc rc
rc
Nf x x x
x x
11 12
11 12
11
!! !
rcxx xrc
rc
Nx x N N N
ij ij ijE x N orij
ij N
Log-likelihood
11 12, , ln ! ln !rc iji j
l N x
ln lnij ij iji j i j
N x x lnij ij
i j
K x where ln ! ln ! lnij
i j
K N x N N
1 2ln ij i ju u u
With the model of independence
and
1 1 1 2 1 2, , , , , ,c rl u u u u u K
1 2ij i ji j
x u u u
with 1 2 0i ji j
u u
1 2i ji ji j
K Nu x u x u
1 2 1 2i j i ju u u u uuij
i j i j i j
e e e e N
also
Let 1 2 21 1 1 2 1 2, , , , , , , , ,c rg u u u u u
1 2
1 11 2i ju uu
i ji j i j
u u e e e N
1 2i ji j
i j
K Nu x u x u
Now
1 2 1 0i ju uu
i j
g N e e e Nu
1
1 2
11
i ju uui
ji
g x e e eu
1
11 0i
i
u
i u
i
ex Ne
1
1
1i
i
ui i
u
i
x xeN Ne
1 111 and 0
ii i
i
xx
rN N N
Since
Now 1
1iu
ie x K
or 11 ln lniiu x K
11 ln ln 0iii i
u x r K
Hence
11ln lni ii
i
u x xr
11ln ln i
i
K xr
and
21ln lnj jj
i
u x xc Similarly
1 2 1 2i j i ju u u u uuij
i j i j i j
e e e e N
Finally
Hence
2
1
1
ju j
c c
jj
xe
x
Now
1 2i j
uu u
i j
Nee e
and
1
1
1
iu i
r r
ii
xe
x
11
1 1
r c cru
i ji ji j
i j
Ne x xx x
11
1 1
1 r c cr
i ji j
x xN
Hence
Note
1 1ln ln lni ji j
u x x Nr c
1 2ln ij i ju u u 1 1ln ln lni j
i j
x x Nr c
1 1ln ln ln lni i j ji i
x x x xr c
ln ln lni jN x x
or i jij
x xN
Comments• Maximum Likelihood estimates can be
computed for any hierarchical log linear model (i.e. more than 2 variables)
• In certain situations the equations need to be solved numerically
• For the saturated model (all interactions and main effects), the estimate of ijk… is xijk… .
Discrete Multivariate Analysis
Analysis of Multivariate Categorical Data
Multiway Frequency Tables
• Two-Way
A
B
• four -Way
A
B
C
D
Log Linear Model
Two- way table
where1( ) 2( ) 12( , ) 12( , ) 0i j i j i j
i j i j
u u u u
1( ) 2( ) 12( , )ln ij i j i ju u u u
jiji
uuuuij
jiji eeee ,1221,1221
The multiplicative form:
Log-Linear model for three-way tablesLet ijk denote the expected frequency in cell (i,j,k) of the table then in general
1( ) 2( ) 3( ) 12( , )ln ij i j k i ju u u u u
1( ) 2( ) 3( ) 12( , ) 12( , )0 i j k i j i ji j k i j
u u u u u
13( , ) 23( , ) 123( , , )i k j k i j ku u u
where
13( , ) 13( , ) 23( , ) 23( , )i k i k j k j ki k j k
u u u u 123( , , ) 123( , , ) 123( , , )i j k i j k i j k
i j k
u u u
Log-Linear model for three-way tablesLet ijk denote the expected frequency in cell (i,j,k) of the table then in general
1( ) 2( ) 3( ) 12( , )ln ij i j k i ju u u u u
13( , ) 23( , ) 123( , , )i k j k i j ku u u
or the multiplicative form1( ) 2( ) 3( ) 12 ( , )ln ij i j k i ju u u uu
ij e e e e e e 13( , ) 23( , ) 123( , , )i k j k i j ku u ue e e
13( , ) 23( , ) 123( , , )i k j k i j k 1( ) 2( ) 3( ) 12( , )i j k i j
Comments• The log-linear model is similar to the ANOVA
models for factorial experiments. • The ANOVA models are used to understand the
effects of categorical independent variables (factors) on a continuous dependent variable (Y).
• The log-linear model is used to understand dependence amongst categorical variables
• The presence of interactions indicate dependence between the variables present in the interactions
Hierarchical Log-linear models for categorical Data
For three way tables
The hierarchical principle:If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction
1.Model: (All Main effects model)ln ijk = u + u1(i) + u2(j) + u3(k)
i.e. u12(i,j) = u13(i,k) = u23(j,k) = u123(i,j,k) = 0.
Notation:[1][2][3]
Description:Mutual independence between all three variables.
2.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j)
i.e. u13(i,k) = u23(j,k) = u123(i,j,k) = 0.
Notation:[12][3]
Description:Independence of Variable 3 with variables 1 and 2.
3.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u13(i,k)
i.e. u12(i,j) = u23(j,k) = u123(i,j,k) = 0.
Notation: [13][2]
Description:Independence of Variable 2 with variables 1 and 3.
4.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u23(j,k)
i.e. u12(i,j) = u13(i,k) = u123(i,j,k) = 0.
Notation: [23][1]
Description:Independence of Variable 3 with variables 1 and 2.
5.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k)
i.e. u23(j,k) = u123(i,j,k) = 0.
Notation:[12][13]
Description:Conditional independence between variables 2 and 3 given variable 1.
6.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u23(j,k)
i.e. u13(i,k) = u123(i,j,k) = 0.
Notation:[12][23]
Description:Conditional independence between variables 1 and 3 given variable 2.
7.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) + u23(j,k)
i.e. u12(i,j) = u123(i,j,k) = 0.
Notation: [13][23]
Description:Conditional independence between variables 1 and 2 given variable 3.
8.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k)
+ u23(j,k) i.e. u123(i,j,k) = 0.
Notation: [12][13][23]
Description:Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.
9.Model: (the saturated model)ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k)
+ u23(j,k) + u123(i,j,k)
Notation: [123]
Description:No simplifying dependence structure.
Hierarchical Log-linear models for 3 way table
Model Description[1][2][3] Mutual independence between all three variables.
[1][23] Independence of Variable 1 with variables 2 and 3.
[2][13] Independence of Variable 2 with variables 1 and 3.
[3][12] Independence of Variable 3 with variables 1 and 2.
[12][13] Conditional independence between variables 2 and 3 given variable 1.
[12][23] Conditional independence between variables 1 and 3 given variable 2.
[13][23] Conditional independence between variables 1 and 2 given variable 3.
[12][13] [23] Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.
[123] The saturated model
Goodness of Fit Statistics
These statistics can be used to check if a log-linear model will fit the
observed frequency table
Goodness of Fit StatisticsThe Chi-squared statistic
22 Observed Expected
Expected
The Likelihood Ratio statistic:
2 2 ln 2 lnˆ
ijkijk
ijk
xObservedG Observed xExpected
d.f. = # cells - # parameters fitted
2ˆ
ˆijk ijk
ijk
x
We reject the model if 2 or G2 is greater than2
/ 2
Example: Variables
Coronary Heart
Serum Cholesterol
Systolic Blood pressure (mm Hg)
Disease (mm/100 cc) <127 127-146 147-166 167+ <200 2 3 3 4
Present 200-219 3 2 0 3 220-259 8 11 6 6 260+ 7 12 11 11 <200 117 121 47 22
Absent 200-219 85 98 43 20 220-259 119 209 68 43 260+ 67 99 46 33
1. Systolic Blood Pressure (B)Serum Cholesterol (C)Coronary Heart Disease (H)
MODEL DF LIKELIHOOD- PROB. PEARSON PROB. RATIO CHISQ CHISQ ----- -- ----------- ------- ------- ------- B,C,H. 24 83.15 0.0000 102.00 0.0000 B,CH. 21 51.23 0.0002 56.89 0.0000 C,BH. 21 59.59 0.0000 60.43 0.0000 H,BC. 15 58.73 0.0000 64.78 0.0000 BC,BH. 12 35.16 0.0004 33.76 0.0007 BH,CH. 18 27.67 0.0673 26.58 0.0872 n.s. CH,BC. 12 26.80 0.0082 33.18 0.0009 BC,BH,CH. 9 8.08 0.5265 6.56 0.6824 n.s.
Goodness of fit testing of Models
Possible Models:1. [BH][CH] – B and C independent given H.2. [BC][BH][CH] – all two factor interaction model
Model 1: [BH][CH] Log-linear parameters
Heart disease -Blood Pressure Interaction
Bp Hd <127 127-146 147-166 167+ Pres -0.256 -0.241 0.066 0.431 Abs 0.256 0.241 -0.066 -0.431
,HB i ju
Bp Hd <127 127-146 147-166 167+ Pres -2.607 -2.733 0.660 4.461 Abs 2.607 2.733 -0.660 -4.461
,
,
HB i j
HB i j
u
uz
Multiplicative effect
,
, ,exp HB i juHB i j HB i ju e
Bp Hd <127 127-146 147-166 167+ Pres 0.774 0.786 1.068 1.538 Abs 1.291 1.272 0.936 0.65
, ,ln ijk H i B j C k HB i j HC i ku u u u u u
, ,H i B j C k HB i j HC i ku u u u uuijk e e e e e e
Log-Linear Model
, ,H i B j C k HB i j HC i k
Heart Disease - Cholesterol Interaction
Chol Hd <200 200-219 220-259 260+ Pres -0.233 -0.325 0.063 0.494 Abs 0.233 0.325 -0.063 -0.494
,HC i ku
,
,
HC i k
HC i k
u
uz
Chol Hd <200 200-219 220-259 260+ Pres -1.889 -2.268 0.677 5.558 Abs 1.889 2.268 -0.677 -5.558
Multiplicative effect
,
, ,exp HB i kuHC i k HB i ku e
Chol Hd <200 200-219 220-259 260+ Pres 0.792 0.723 1.065 1.640 Abs 1.262 1.384 0.939 0.610
Model 2: [BC][BH][CH] Log-linear parameters
Blood pressure-Cholesterol interaction: ,BC j ku
Bp Chol <200 200-219 220-259 260+ <200 0.222 -0.019 -0.034 -0.169 200-219 0.114 -0.041 0.013 -0.086 220-259 -0.114 0.154 -0.058 0.018 260+ -0.221 -0.094 0.079 0.237
,
,
BC j k
BC j k
u
uz
Bp Chol <200 200-219 220-259 260+ <200 2.68 -0.236 -0.326 -1.291 200-219 1.27 -0.472 0.117 -0.626 220-259 -1.502 2.253 -0.636 0.167 260+ -2.487 -1.175 0.785 2.051
Bp Chol <200 200-219 220-259 260+ <200 1.248 0.981 0.967 0.844 200-219 1.120 0.960 1.013 0.918 220-259 0.892 1.166 0.944 1.018 260+ 0.802 0.910 1.082 1.267
Multiplicative effect ,
, ,exp HB j kuBC j k BC j ku e
Heart disease -Blood Pressure Interaction
Bp Hd <127 127-146 147-166 167+ Pres -0.211 -0.232 0.055 0.389 Abs 0.211 0.232 -0.055 -0.389
,HB i ju
Bp Hd <127 127-146 147-166 167+ Pres -2.125 -2.604 0.542 3.938 Abs 2.125 2.604 -0.542 -3.938
,
,
HB i j
HB i j
u
uz
Multiplicative effect
,
, ,exp HB i juHB i j HB i ju e
Bp Hd <127 127-146 147-166 167+ Pres 0.809 0.793 1.056 1.475 Abs 1.235 1.261 0.947 0.678
Heart Disease - Cholesterol Interaction
Chol Hd <200 200-219 220-259 260+ Pres -0.212 -0.316 0.069 0.460 Abs 0.212 0.316 -0.069 -0.460
,HC i ku
,
,
HC i k
HC i k
u
uz
Chol Hd <200 200-219 220-259 260+ Pres -1.712 -2.199 0.732 5.095 Abs 1.712 2.199 -0.732 -5.095
Multiplicative effect
,
, ,exp HB i kuHC i k HB i ku e
Chol Hd <200 200-219 220-259 260+ Pres 0.809 0.729 1.071 1.584 Abs 1.237 1.372 0.933 0.631
Another Example
In this study it was determined for N = 4353 males
1. Occupation category2. Educational Level3. Academic Aptidude
1. Occupation categoriesa. Self-employed Businessb. Teacher\Educationc. Self-employed Professionald. Salaried Employed
2. Education levelsa. Lowb. Low/Medc. Medd. High/Mede. High
3. Academic Aptitudea. Lowb. Low/Medc. High/Medd. High
Table Self-employed, Business Teacher Education Education
Aptitude Low LMed HMed High Total Aptitude Low LMed HMed High Total Low 42 55 22 3 122 Low 0 0 1 19 20
LMed 72 82 60 12 226 LMed 0 3 3 60 66 Med 90 106 85 25 306 Med 1 4 5 86 96
HMed 27 48 47 8 130 HMed 0 0 2 36 38 High 8 18 19 5 50 High 0 0 1 14 15 Total 239 309 233 53 834 Total 1 7 12 215 235
Self-employed, Professional Salaried Employed Education Education
Aptitude Low LMed HMed High Total Aptitude Low LMed HMed High Total Low 1 2 8 19 30 Low 172 151 107 42 472
LMed 1 2 15 33 51 LMed 208 198 206 92 704 Med 2 5 25 83 115 Med 279 271 331 191 1072
HMed 2 2 10 45 59 HMed 99 126 179 97 501 High 0 0 12 19 31 High 36 35 99 79 249 Total 6 11 70 199 286 Total 794 781 922 501 2998
Two-way Tables (With 2): Education vs Aptitude Education vs Occcupation
(2 = 178.6) (2 = 1254.1) Low Lmed HMed High Total Low Lmed HMed High Total
Low 215 208 138 83 644 SEB 239 309 233 53 834 Lmed 281 285 284 197 1047 SEP 6 11 70 199 286 Med 372 386 446 385 1589 TCHR 1 7 12 215 235
HMed 128 176 238 186 728 SEM 794 781 922 501 2998 High 44 53 131 117 345 Total 1040 1108 1237 968 4353 Total 1040 1108 1237 968 4353
Aptitude vs Occupation
(2 = 35.8) SEB SEP TCHR SEM Total
Low 122 30 20 472 644 Lmed 226 51 66 704 1047 Med 306 115 96 1072 1589
HMed 130 59 38 501 728 High 50 31 15 249 345 Total 834 286 235 2998 4353
• It is common to handle a Multiway table by testing for independence in all two way tables.
• This is similar to looking at all the bivariate correlations
• In this example we learn that:
1. Education is related to Aptitude2. Education is related to Occupational category3. Education is related to Aptitude
Can we do better than this?
Fitting various log-linear models
Goodness of fit
Model Likelihood
Ratio DF Sig. Pearson DF Sig. [Occ][Ed][Apt] 1356.9702 69 0.0000 1519.802 69 0.0000 [Occ, Ed] [Apt] 228.2215 60 0.0000 226.6615 60 0.0000 [Apt, Ed][Occ] 1179.6403 57 0.0000 1336.765 57 0.0000 [Apt, Occ][Ed] 1319.561 57 0.0000 1424.1488 57 0.0000 [Occ, Ed] [Occ,Apt] 190.8123 48 0.0000 184.6386 48 0.0000 [Apt, Ed] [Occ,Apt] 1142.2311 45 0.0000 1301.1317 45 0.0000 [Apt, Ed] [Occ, Ed] 50.8915 48 0.3605 48.0105 48 0.4724 [Apt, Ed] [Occ, Ed] [Occ, Apt] 25.1048 36 0.9134 23.6465 36 0.9436
Simplest model that fits is: [Apt,Ed][Occ,Ed]This model implies conditional independence betweenAptitude and Occupation given Education.
Log-linear ParametersAptitude – Education Interaction
Education Aptitude Low Low-Med High-Med High
Low 0.4602 0.3225 -0.2752 -0.5075 Low-Med 0.1857 0.0953 -0.0957 -0.1853
Med 0.0399 -0.0277 -0.0706 0.0584 High-Med -0.2250 -0.0111 0.1032 0.1329
High -0.4607 -0.3791 0.3383 0.5015
Aptitude – Education Interaction (Multiplicative)
Education Aptitude Low Low-Med High-Med High
Low 1.584 1.381 0.759 0.602 Low-Med 1.204 1.100 0.909 0.831
Med 1.041 0.973 0.932 1.060 High-Med 0.799 0.989 1.109 1.142
High 0.631 0.684 1.403 1.651
Occupation – Education Interaction
Occupation Education SEB T SEP SAL
Low 1.241 -1.528 -0.718 1.005 LowMed 0.800 -0.280 -0.810 0.290 HighMed -0.050 -0.309 0.472 -0.112
High -1.991 2.117 1.057 -1.182
Occupation – Education Interaction (Multiplicative)
Occupation Education SEB T SEP SAL
Low 3.460 0.217 0.488 2.731 LowMed 2.226 0.756 0.445 1.336 HighMed 0.951 0.734 1.603 0.894
High 0.137 8.303 2.877 0.307
Conditional Test Statistics
• Suppose that we are considering two Log-linear models and that Model 2 is a special case of Model 1.
• That is the parameters of Model 2 are a subset of the parameters of Model 1.
• Also assume that Model 1 has been shown to adequately fit the data.
In this case one is interested in testing if the differences in the expected frequencies between Model 1 and Model 2 is simply due to random variation] The likelihood ratio chi-square statistic that achieves this goal is:
2 2 22 1 2 1G G G
1
2
2Expected
ObservedExpected
2 1df df df
Example
Table 1: Cross-Classification of a Sample of 1008 consumers according to: (1) The Softness of the Laundry Used (2) The Previous Use of Detergent Brand M (3) The Temperature of the Laundry Water Used (4) The preference of Detergent Brand X over Brand M in a Consumer Blind Trial. Previous user of M Previous nonuser of M
Water Softness
Brand Preference
High Temperature
Low Temperature
High Temperature
Low Temperature
Soft X 19 57 29 63 M 29 49 27 53 Medium X 23 47 33 66 M 47 55 23 50 Hard X 24 37 42 68 M 43 52 30 42
Model d.f. G2 p - valueAll k-factor models[1][2][3][4] 18 42.9 0.00083 G2(1)[12][13][14][23][24][34] 9 9.9 0.35864 G2(2)[123][124][134][234] 2 0.7 0.70469 G2(3)[1234] 0 0.0 G2(4)
Goodness of Fit test for the all k-factor models
Model d.f. G2 p - valuetwo-factor interactions 9 33.0 0.00013 G2(1|2)= G2(1)-G2(2)three-factor interactions 7 9.2 0.23861 G2(2|3)= G2(2)-G2(3)four-factor interaction 2 0.7 0.70469 G2(3|4)= G2(3)-G2(4)
Conditional tests for zero k-factor interactions
Conclusions
1. The four factor interaction is not significant G2(3|4) = 0.7 (p = 0.705)
2. The all three factor model provides a significant fit G2(3) = 0.7 (p = 0.705)
3. All the three factor interactions are not significantly different from 0, G2(2|3) = 9.2 (p = 0.239).
4. The all two factor model provides a significant fit G2(2) = 9.9 (p = 0.359)
5. There are significant 2 factor interactions G2(1|2) = 33.0 (p = 0.00083.
Conclude that the model should contain main effects and some two-factor interactions
There also may be a natural sequence of progressively complicated models that one might want to identify.In the laundry detergent example the variables are:
1. Softness of Laundry Used2. Previous use of Brand M3. Temperature of laundry water used4. Preference of brand X over brand M
A natural order for increasingly complex models which should be considered might be:
1. [1][2][3][4]2. [1][3][24]3. [1][34][24]4. [13][34][24]5. [13][234]6. [134][234]
The all-Main effects model Independence amongst all four variables
Since previous use of brand M may be highly related to preference for brand M, add first the 2-4 interaction
Brand M is recommended for hot water add 2nd the 3-4 interactionbrand M is also recommended for Soft laundry add 3rd the 1-3 interaction
Add finally some possible 3-factor interactions
Models d]f] G2
[1][3][24] 17 22.4[1][24][34] 16 18[13][24][34] 14 11.9[13][23][24][34] 13 11.2[12][13][23][24][34] 11 10.1[1][234] 14 14.5[134][24] 10 12.2[13][234] 12 8.4[24][34][123] 9 8.4[123][234] 8 5.6
Likelihood Ratio G2 for various models
Table 2: A Partitioning of the Likelihood Ratio Chi-Square Statistic for Complete Independence (Model (a) = [1][2][3][4], Model (b) = [1][3][24], Model (c) = [1][24][34], Model (d) = [13][24][34], Model (e) = [13][234], Model (f) = [123][234]) Model d.f. G2 Model (a) 18 42.9* Difference between models (b) and (a) 1 20.5* Model (b) 17 22.4 Difference between models (c) and (b) 1 4.4* Model (c) 16 18.0 Difference between models (d) and (c) 2 6.1* Model (d) 14 11.9 Difference between models (e) and (d) 2 3.5 Model (e) 12 8.4 Difference between models (f) and (e) 4 2.8 Model (f) 8 5.6
Discrete Multivariate Analysis
Analysis of Multivariate Categorical Data
Log-Linear model for three-way tables
Let ijk denote the expected frequency in cell (i,j,k) of the table then in general
1( ) 2( ) 3( ) 12( , )ln ij i j k i ju u u u u
1( ) 2( ) 3( ) 12( , ) 12( , )0 i j k i j i ji j k i j
u u u u u
13( , ) 23( , ) 123( , , )i k j k i j ku u u
where
13( , ) 13( , ) 23( , ) 23( , )i k i k j k j ki k j k
u u u u 123( , , ) 123( , , ) 123( , , )i j k i j k i j k
i j k
u u u
Hierarchical Log-linear models for categorical Data
For three way tables
The hierarchical principle:If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction
Models for three-way tables
1.Model: (All Main effects model)ln ijk = u + u1(i) + u2(j) + u3(k)
i.e. u12(i,j) = u13(i,k) = u23(j,k) = u123(i,j,k) = 0.
Notation:[1][2][3]Description:Mutual independence between all three variables.
Comment: For any model the parameters (u, u1(i) , u2(j) , u3(k)) can be estimated in addition to the expected frequencies (ijk) in each cell
2.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j)
i.e. u13(i,k) = u23(j,k) = u123(i,j,k) = 0.
Notation:[12][3]
Description:Independence of Variable 3 with variables 1 and 2.
3.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u13(i,k)
i.e. u12(i,j) = u23(j,k) = u123(i,j,k) = 0.
Notation: [13][2]
Description:Independence of Variable 2 with variables 1 and 3.
4.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u23(j,k)
i.e. u12(i,j) = u13(i,k) = u123(i,j,k) = 0.
Notation: [23][1]
Description:Independence of Variable 3 with variables 1 and 2.
5.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k)
i.e. u23(j,k) = u123(i,j,k) = 0.
Notation:[12][13]
Description:Conditional independence between variables 2 and 3 given variable 1.
6.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u23(j,k)
i.e. u13(i,k) = u123(i,j,k) = 0.
Notation:[12][23]
Description:Conditional independence between variables 1 and 3 given variable 2.
7.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u13(i,k) + u23(j,k)
i.e. u12(i,j) = u123(i,j,k) = 0.
Notation: [13][23]
Description:Conditional independence between variables 1 and 2 given variable 3.
8.Model:ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k)
+ u23(j,k) i.e. u123(i,j,k) = 0.
Notation: [12][13][23]
Description:Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.
9.Model: (the saturated model)ln ijk = u + u1(i) + u2(j) + u3(k) + u12(i,j) + u13(i,k)
+ u23(j,k) + u123(i,j,k)
Notation: [123]
Description:No simplifying dependence structure.
Goodness of Fit StatisticsThe Chi-squared statistic
22 Observed Expected
Expected
The Likelihood Ratio statistic:
2 2 ln 2 lnˆ
ijkijk
ijk
xObservedG Observed xExpected
d.f. = # cells - # parameters fitted
2ˆ
ˆijk ijk
ijk
x
We reject the model if 2 or G2 is greater than2
/ 2
Conditional Test Statistics
In this case one is interested in testing if the differences in the expected frequencies between Model 1 and Model 2 is simply due to random variation] The likelihood ratio chi-square statistic that achieves this goal is:
2 2 22 1 2 1G G G
1
2
2Expected
ObservedExpected
2 1df df df
Stepwise selection procedures
Forward SelectionBackward Elimination
Forward Selection: Starting with a model that under fits the data, log-linear parameters that are not in the model are added step by step until a model that does fit is achieved. At each step the log-linear parameter that is most significant is added to the model:To determine the significance of a parameter added we use the statistic:
G2(2|1) = G2(2) – G2(1)Model 1 contains the parameter.Model 2 does not contain the parameter
Backward Elimination: Starting with a model that over fits the data, log-linear parameters that are in the model are deleted step by step until a model that continues to fit the model and has the smallest number of significant parameters is achieved.At each step the log-linear parameter that is least significant is deleted from the model:
To determine the significance of a parameter deleted we use the statistic:
G2(2|1) = G2(2) – G2(1)Model 1 contains the parameter.Model 2 does not contain the parameter
Example: Fitting a Log-linear model – Forward Selection Table: Dyke -Patterson Data - N=1729 individuals classified according to five variables (1) Reading Newspapers (2) Listen to radio (3) Do "solid'" reading (4) Attend Lectures (5) Knowledge regarding cancer
Radio No Radio Solid
Reading No solid Reading
Solid Reading
No solid Reading
Good Poor Good Poor Good Poor Good Poor Newspaper Lectures 23 8 8 4 27 18 7 6 None 102 67 35 59 201 177 75 156 None Lectures 1 3 4 3 3 8 2 10 None 16 16 13 50 67 83 84 393
MODEL D.F. CHI-SQUARE PROB CHI-SQUARE PROB ----- ---- ---------- ---- ---------- ---- K,L,N,S,R. 26 596.84 0.0000 751.31 0.0000 MODELS FORMED BY ADDING TERMS TO MODEL -- K,L,N,S,R. LIKELIHOOD-RATIO PEARSON MODEL D.F. CHI-SQUARE PROB CHI-SQUARE PROB ----- ---- ---------- ---- ---------- ---- KL,N,S,R. 25 579.68 0.0000 691.18 0.0000 DIFF. DUE TO ADDING KL. 1 17.16 0.0000 KN,L,S,R. 25 491.06 0.0000 533.89 0.0000 DIFF. DUE TO ADDING KN. 1 105.78 0.0000 KS,L,N,R. 25 446.39 0.0000 497.12 0.0000 DIFF. DUE TO ADDING KS. 1 150.45 0.0000 KR,L,N,S. 25 572.59 0.0000 674.61 0.0000 DIFF. DUE TO ADDING KR. 1 24.25 0.0000 K,LN,S,R. 25 575.24 0.0000 688.89 0.0000 DIFF. DUE TO ADDING LN. 1 21.60 0.0000 K,LS,N,R. 25 573.09 0.0000 692.25 0.0000 DIFF. DUE TO ADDING LS. 1 23.74 0.0000 K,LR,N,S. 25 577.89 0.0000 698.17 0.0000 DIFF. DUE TO ADDING LR. 1 18.95 0.0000 K,L,NS,R. 25 343.13 0.0000 383.90 0.0000 DIFF. DUE TO ADDING NS. 1 253.71 0.0000 K,L,NR,S. 25 522.61 0.0000 615.20 0.0000 DIFF. DUE TO ADDING NR. 1 74.23 0.0000 K,L,N,SR. 25 575.76 0.0000 680.88 0.0000 DIFF. DUE TO ADDING SR. 1 21.08 0.0000 STEP 1. BEST MODEL FOUND IS -- K,L,NS,R.
K = knowledge
N = Newspaper
R = Radio
S = Reading
L = Lectures
KL,NS,R. 24 325.97 0.0000 339.14 0.0000 DIFF. DUE TO ADDING KL. 1 17.16 0.0000 KN,L,NS,R. 24 237.35 0.0000 258.87 0.0000 DIFF. DUE TO ADDING KN. 1 105.78 0.0000 KS,L,NS,R. 24 192.68 0.0000 216.12 0.0000 DIFF. DUE TO ADDING KS. 1 150.45 0.0000 KR,L,NS. 24 318.88 0.0000 329.40 0.0000 DIFF. DUE TO ADDING KR. 1 24.25 0.0000 K,LN,NS,R. 24 321.53 0.0000 341.35 0.0000 DIFF. DUE TO ADDING LN. 1 21.60 0.0000 K,LS,NS,R. 24 319.39 0.0000 348.68 0.0000 DIFF. DUE TO ADDING LS. 1 23.75 0.0000 K,LR,NS. 24 324.18 0.0000 341.62 0.0000 DIFF. DUE TO ADDING LR. 1 18.95 0.0000 K,L,NR,NS. 24 268.90 0.0000 280.86 0.0000 DIFF. DUE TO ADDING NR. 1 74.23 0.0000 K,L,SR,NS. 24 322.05 0.0000 347.33 0.0000 DIFF. DUE TO ADDING SR. 1 21.08 0.0000 STEP 2. BEST MODEL FOUND IS -- KS,L,NS,R.
KL,KS,NS,R. 23 175.52 0.0000 182.86 0.0000 DIFF. DUE TO ADDING KL. 1 17.16 0.0000 KN,KS,L,NS,R. 23 152.96 0.0000 163.87 0.0000 DIFF. DUE TO ADDING KN. 1 39.72 0.0000 KR,KS,L,NS. 23 168.43 0.0000 173.32 0.0000 DIFF. DUE TO ADDING KR. 1 24.25 0.0000 KS,LN,NS,R. 23 171.08 0.0000 184.56 0.0000 DIFF. DUE TO ADDING LN. 1 21.60 0.0000 LS,KS,NS,R. 23 168.93 0.0000 202.28 0.0000 DIFF. DUE TO ADDING LS. 1 23.74 0.0000 KS,LR,NS. 23 173.73 0.0000 178.08 0.0000 DIFF. DUE TO ADDING LR. 1 18.95 0.0000 KS,L,NR,NS. 23 118.45 0.0000 128.83 0.0000 DIFF. DUE TO ADDING NR. 1 74.23 0.0000 SR,KS,L,NS. 23 171.60 0.0000 198.23 0.0000 DIFF. DUE TO ADDING SR. 1 21.08 0.0000 STEP 3. BEST MODEL FOUND IS -- KS,L,NR,NS.
LN,KL,SR,KR,KN,LR,LS,KS,NR,NS. 16 19.56 0.2406 21.21 0.1706 DIFF. DUE TO ADDING SR. 1 0.42 0.5147 KLN,KR,LR,LS,KS,NR,NS. 16 18.86 0.2762 21.53 0.1589 DIFF. DUE TO ADDING KLN. 1 1.13 0.2878 LN,KLS,KR,KN,LR,NR,NS. 16 15.99 0.4538 15.63 0.4794 DIFF. DUE TO ADDING KLS. 1 4.00 0.0456 LN,KLR,KN,LS,KS,NR,NS. 16 19.28 0.2543 20.81 0.1860 DIFF. DUE TO ADDING KLR. 1 0.70 0.4015 LN,KL,KR,KNS,LR,LS,NR. 16 16.78 0.4000 18.74 0.2821 DIFF. DUE TO ADDING KNS. 1 3.21 0.0733 LN,KL,KNR,LR,LS,KS,NS. 16 19.90 0.2247 21.27 0.1682 DIFF. DUE TO ADDING KNR. 1 0.09 0.7704 LNS,KL,KR,KN,LR,KS,NR. 16 19.58 0.2397 20.98 0.1794 DIFF. DUE TO ADDING LNS. 1 0.41 0.5239 LNR,KL,KR,KN,LS,KS,NS. 16 18.11 0.3176 18.80 0.2790 DIFF. DUE TO ADDING LNR. 1 1.88 0.1706 STEP 10. BEST MODEL FOUND IS -- LN,KLS,KR,KN,LR,NR,NS.
Continuing after 10 steps
LN,SR,KLS,KR,KN,LR,NR,NS. 15 15.55 0.4127 15.15 0.4406 DIFF. DUE TO ADDING SR. 1 0.44 0.5072 KLN,KLS,KR,LR,NR,NS. 15 12.98 0.6041 13.84 0.5379 DIFF. DUE TO ADDING KLN. 1 3.01 0.0827 LN,KLR,KLS,KN,NR,NS. 15 15.10 0.4446 15.06 0.4471 DIFF. DUE TO ADDING KLR. 1 0.89 0.3446 LN,KNS,KLS,KR,LR,NR. 15 13.21 0.5861 13.19 0.5878 DIFF. DUE TO ADDING KNS. 1 2.78 0.0955 LN,KLS,KNR,LR,NS. 15 15.93 0.3870 15.48 0.4173 DIFF. DUE TO ADDING KNR. 1 0.06 0.8034 LNS,KLS,KR,KN,LR,NR. 15 15.87 0.3905 15.60 0.4089 DIFF. DUE TO ADDING LNS. 1 0.12 0.7343 LNR,KLS,KR,KN,NS. 15 14.23 0.5085 13.75 0.5446 DIFF. DUE TO ADDING LNR. 1 1.76 0.1842 STEP 11. BEST MODEL FOUND IS -- KLN,KLS,KR,LR,NR,NS.
The final step
The best model was found a the previous step• [LN][KLS][KR][KN][LR][NR][NS]
Modelling of response variables
Independent → Dependent
Logit Models
To date we have not worried whether any of the variables were dependent of independent variables. The logit model is used when we have a single binary dependent variable.
Example: Logit Models Table: The Effect of planting depth on mortality of Pine seedlings Longleaf Seedlings Slash Seedlings
Depth of Planting Dead Alive Totals Dead Alive Totals Too High 41 59 100 12 88 100 Too Low 11 89 100 5 95 100
Totals 52 148 200 17 183 200 Table: Loglinear Models Fit to Data in Above Table and their Goodness of Fit Statistics Model 2 G2 df [12][13][23] 1.37 1.28 1 [13][23] 26.54 27.79 2 [12][13] 24.03 25.03 2 [13][2] 54.70 50.10 3
The variables1. Type of seedling (T)
a. Longleaf seedlingb. Slash seedling
2. Depth of planting (D)a. Too low.b. Too high
3. Mortality (M) (the dependent variable)a. Deadb. Alive
The Log-linear Model
Note: ij1 = # dead when T = i and D = j.
ln ijk T i D j M ku u u u
, , , , ,TD i j TM i k DM j k TDM i j ku u u u
ij2 = # alive when T = i and D = j.
1
2
ij
ij
deadalive
= mortality ratio when T = i and D = j.
Hence
1T i D j Mu u u u
, ,1 ,1 , ,1TD i j TM i DM j TDM i ju u u u
11 2
2
ln ln ln log-mortality ratioijij ij
ij
since
2T i D j Mu u u u
, ,2 ,2 , ,2TD i j TM i DM j TDM i ju u u u
1 ,1 ,1 , ,12 2 2 2M TM i DM j TDM i ju u u u
2 1 ,2 ,1, ,M M TM i TM iu u u u
,2 ,1 , ,2 , ,1,DM j DM j TDM i j TDM i ju u u u
The logit model:1
1 22
ln ln ln log-mortality ratioijij ij
ij
where ,T i D j TD i jv v v v
1 ,1 ,12 , 2 , 2 , andM T i TM i D j DM jv u v u v u
, , ,12TD i j TDM i jv u
Thus corresponding to a loglinear model there is logit model predicting log ratio of expected frequencies of the two categories of the independent variable.
Also k +1 factor interactions with the dependent variable in the loglinear model determine k factor interactions in the logit modelk + 1 = 1 constant term in logit modelk + 1 = 2, main effects in logit model
Example: Logit Models Table: The Effect of planting depth on mortality of Pine seedlings Longleaf Seedlings Slash Seedlings
Depth of Planting Dead Alive Totals Dead Alive Totals Too High 41 59 100 12 88 100 Too Low 11 89 100 5 95 100
Totals 52 148 200 17 183 200 Table: Loglinear Models Fit to Data in Above Table and their Goodness of Fit Statistics Model 2 G2 df [12][13][23] 1.37 1.28 1 [13][23] 26.54 27.79 2 [12][13] 24.03 25.03 2 [13][2] 54.70 50.10 3
1 = Depth, 2 = Mort, 3 = Type
Log-Linear parameters for Model: [TM][TD][DM]Main Effects: Mort Mort ------ Dead Alive ------------------- -0.946 0.946 Type Type ------ Lleaf Slash ------------------- 0.240 -0.240 Depth Depth ------ low high ------------------- 0.257 -0.257
Two-Factor Interactions: Type-Mort Type Mort ------ ------ Dead Alive --------------------------- Lleaf 0.354 -0.354 Slash -0.354 0.354
Depth-Mort Depth Mort ------ ------ Dead Alive --------------------------- low 0.376 -0.376 high -0.376 0.376 Mort -Type Depth Type ------ ------ Lleaf Slash --------------------------- low -0.063 0.063 high 0.063 -0.063
Logit Model for predicting the Mortality
ln D i T kMR v v v
D i T kv vvdeadMR e e ealive
or
Log-Linear Logit Multconst -0.946 -1.892 0.151Depth- High 0.354 0.708 2.030
Low -0.354 -0.708 0.493Type-Long 0.376 0.752 2.121
Slash -0.376 -0.752 0.471
Example: Fitting a Log-linear model – Forward Selection Table: Dyke -Patterson Data - N=1729 individuals classified according to five variables (1) Reading Newspapers (2) Listen to radio (3) Do "solid'" reading (4) Attend Lectures (5) Knowledge regarding cancer
Radio No Radio Solid
Reading No solid Reading
Solid Reading
No solid Reading
Good Poor Good Poor Good Poor Good Poor Newspaper Lectures 23 8 8 4 27 18 7 6 None 102 67 35 59 201 177 75 156 None Lectures 1 3 4 3 3 8 2 10 None 16 16 13 50 67 83 84 393
The best model was found by forward selection was[LN][KLS][KR][KN][LR][NR][NS]
To fit a logit model to predict K (Knowledge) we need to fit a loglinear model with important interactions with K (knowledge), namely
[LNRS][KLS][KR][KN]The logit model will containMain effects for L (Lectures), N (Newspapers), R (Radio), and S (Reading)Two factor interaction effect for L and S
The Logit Parameters for the Model : LNSR, KLS, KR, KN ( Multiplicative effects are given in brackets, Logit Parameters = 2 Loglinear parameters)The Constant term:
-0.226 (0.798)The Main effects on Knowledge:Lectures Lect 0.268 (1.307)
None -0.268 (0.765)Newspaper News 0.324 (1.383)
None -0.324 (0.723)Reading Solid 0.340 (1.405)
Not -0.340 (0.712)Radio Radio 0.150 (1.162)
None -0.150 (0.861)
The Two-factor interaction Effect of Reading and Lectures on Knowledge
Reading Lectures Solid Not
Lect -0.180 (0.835) 0.180 (1.197) None 0.180 (1.197) -0.180 (0.835)
ratio goodKpoor
Fitting a Logit Model with a Polytomous Response Variable
Example: Table
Observed Cross-Classification of 2294 Males Who Failed to Pass the Armed Forces Qualification Test
Father's Respondent's Education Race Age Education Grammar School Some HS HS Graduate
GS 39 29 8 < 22 Some HS 4 8 1 HS Grad 11 9 6 NA 48 17 8
White GS 231 115 51 22 Some HS 17 21 13 HS Grad 18 28 45 NA 197 111 35 GS 19 40 19 < 22 Some HS 5 17 7 HS Grad 2 14 3 NA 49 79 24
Black GS 110 133 103 22 Some HS 18 38 25 HS Grad 11 25 18 NA 178 206 81
NA – Not available
The variables
1. Race – white, black2. Age - < 22, ≥ 223. Father’s education – GS, some HS, HS grad,
NA4. Respondents Education - GS, some HS, HS
grad – the response (dependent) variable
Table: Various Loglinear Models Fit to the 3 4 2 2 Table above Model d.f. G2 p-value [234][1] 30 254.8 0.0000 [234][12] 24 162.6 0.0000 [234][13] 28 242.7 0.0000 [234][14] 28 152.8 0.0000 [234][12][13] 22 151.5 0.0000 [234][12][14] 22 46.7 0.0016 [234][13][14] 26 142.5 0.0000 [234][12][13][14] 20 36.9 0.0120 [234][123][14] 14 27.9 0.0147 [234][124][13] 14 18.1 0.2023 [234][134][12] 18 33.2 0.0158 [234][123][124] 8 9.7 0.2867
Techniques for handling Polytomous Response VariableApproaches1. Consider the categories 2 at a time. Do this for all
possible pairs of the categories.2. Look at the continuation ratios
i. 1 vs 2ii. 1,2 vs 3iii. 1,2,3 vs 4iv. etc
Table Estimated Logit Effects for The Three Logit Models
Corresponding to the Log Linear Model - [234][124][13]
Grammar vs Some HS
log(m1jkl/m2jkl)
Grammar vs HS Grad
log(m1jkl/m3jkl)
Some HS vs HS Grad
log(m2jkl/m3jkl) Constant -0.289 0.451 0.740
Race White 0.395 0.390 -0.005 Black -0.395 -0.390 0.005
Age < 22 -0.120 0.099 0.219 ≥ 22 0.120 -0.099 -0.219 Grammar 0.380 0.406 0.026
Father's Some HS -0.371 -0.355 0.016 Education HS Grad -0.441 -0.918 -0.477
NA 0.432 0.867 0.435
Race - Father's Education Interaction Grammar 0.063 0.345 0.282
White by Some HS -0.128 -0.016 0.112 HS Grad 0.030 -0.429 -0.459 NA 0.035 0.101 0.066 \Grammar -0.063 -0.345 -0.282
Black by Some HS 0.128 0.016 -0.112 HS Grad -0.030 0.429 0.459 NA -0.035 -0.101 -0.066
Table Multiplicative Logit Effects for The Three Logit Models Corresponding to the Log Linear Model - [234][124][13]
Grammar vs Some HS
log(m1jkl/m2jkl)
Grammar vs HS Grad
log(m1jkl/m3jkl)
Some HS vs HS Grad
log(m2jkl/m3jkl) Constant 0.749 1.570 2.096
Race White 1.484 1.477 0.995 Black 0.674 0.677 1.005
Age < 22 0.887 1.104 1.245 ≥ 22 1.127 0.906 0.803 Grammar 1.462 1.501 1.026
Father's Some HS 0.690 0.701 1.016 Education HS Grad 0.643 0.399 0.621
NA 1.540 2.380 1.545
Race - Father's Education Interaction Grammar 1.065 1.412 1.326
White by Some HS 0.880 0.984 1.119 HS Grad 1.030 0.651 0.632 NA 1.036 1.106 1.068 Grammar 0.939 0.708 0.754
Black by Some HS 1.137 1.016 0.894 HS Grad 0.970 1.536 1.582 NA 0.966 0.904 0.936
Table Various Logit Models for thre Log Continuation ratios in the first Table
a log
m2jkm1jk
b log
m3jkm1jk m2jk
Combined Fit
Model d.f. G2 d.f. G2 d.f. G2 [234][1] 15 131.5 15 123.3 30 254.8 [234][12] 12 97.9 12 64.7 24 162.6 [234][13] 14 123.3 14 119.4 28 242.7 [234][14] 14 49.0 14 102.8 28 152.8 [234][12][13] 11 91.9 11 60.3 22 152.2 [234][12][14] 11 16.1 11 35.6 22 51.7 [234][13][14] 13 43.7 13 98.7 26 142.4 [234][12][13][14] 10 12.4 10 29.8 20 42.2 [234][123][14] 7 9.3 7 23.2 14 32.5 [234][124][13] 7 9.3 7 23.2 14 18.5 [234][134][12] 9 8.6 9 29.7 18 38.3 [234][123][124] 4 8.5 4 1.2 8 9.7
Causal or Path Analysis for Categorical Data
When the data is continuous, a causal pattern may be assumed to exist amongst the variables.The path diagramThis is a diagram summarizing causal relationships.Straight arrows are drawn between a variable that has some cause and effect on another variable X YCurved double sided arrows are drawn between variables that are simply correlated
X Y
Example 1 The variables – Job stress, Smoking, Heart DiseaseThe path diagram
Job Stress
Heart Disease
Smoking
In Path Analysis for continuous variables, one is interested in determining the contribution along each path (the path coefficents)
Example 2The variables – Job stress, Alcoholic Drinking, Smoking, Heart DiseaseThe path diagram Job
Stress
Heart Disease
SmokingDrinking
In analysis of categorical data there are no path coefficients but path diagrams can point to the appropriate logit analysis
ExampleIn this example the data consists of a two wave, two variable panel data for a sample of n =3398 schoolboys.It is looking at “membership” and “attitude towards” the leading crowd.
The path diagram: A B C D This suggest predicting B from A, thenC from A and B and finallyD from A, B and C.
Examples of Causal Analysis Using Recursive Systems of Logit Models Example 1 Two-Wave Two-Variable Panel Data for 3398 Schoolboys: Membership in and attitude toward the "Leading Crowd".
Second Interview Membership + + - - Attitude + - + -
Membership Attitude + + 458 140 110 49 First + - 171 182 56 87 Interview - + 184 75 531 281 - - 85 97 338 554
A = Membership at first interview , B = Attitude at first interview C = Membership at second interview, D = Attitude at second interview
Two-way Analysis for determining the effect of A on B Attitude(B)
+ - + 757 496 Membership
(A)
- 1071 1074
Goodness of Fit Statistics for determining the effect of A, B on C 1. [AB][AC][BC] (1 df; G2 = 0.0) 2. [AB][BC] (2 df; G2 = 1005.1) 3. [AB][AC] (2 df; G2 = 27.2) Identified Logit Model (Model # 1. [AB][AC][BC])
logitAB|C
ij log
mAB|Cij1
mAB|Cij2
wAB|C wAB|C
1i wAB|C2j
Goodness of Fit Statistics for determining the effect of A, B, C on D 4. [ABC][AD][BD][CD] (4 df; G2 = 1.2) 5. [ABC][BD][CD] (5 df; G2 = 4.0) 6. [ABC][AD][CD] (5 df; G2 = 262.5) 7. [ABC][AD][BD] (5 df; G2 = 15.7)
Identified Logit Model (Model # 5. [ABC][BD][CD])
logitABC|D
ijk wABC|D wABC|D2j wABC|CD
3k
Example 2In this example we are looking at 1. Social Economic Status (SES)2. Sex3. IQ4. Parental Encouragement for Higher
Education (PE)5. College Plans(CP)
Social Class, Parental Encouragement,IQ, and Educational Aspirations College Parental SES Sex IQ Plans Encouragement L LM UM H M L Yes Low 4 2 8 4 High 13 27 47 39 No Low 349 232 166 48 High 64 84 91 57 LM Yes Low 9 7 6 5 High 33 64 74 123 No Low 207 201 120 47 High 72 95 110 90 UM Yes Low 12 12 17 9 High 38 93 148 224 No Low 126 115 92 41 High 54 92 100 65 H Yes Low 10 17 6 8 High 49 119 198 414 No Low 67 79 42 17 High 43 59 73 54 M L Yes Low 5 11 7 6 High 9 29 36 36 No Low 454 285 163 50 High 44 61 72 58 LM Yes Low 5 19 13 5 High 14 47 75 110 No Low 312 236 193 70 High 47 88 90 76 UM Yes Low 8 12 12 12 High 20 62 91 230 No Low 216 164 174 48 High 35 85 100 81 H Yes Low 13 15 20 13 High 28 72 142 360 No Low 96 113 81 49 High 24 50 77 98
The Path Diagram
SES
Sex
IQ
PE
CP
The path diagram suggests
1. Predicting Parental Encouragement from Sex, SocioEconomic status, and IQ, then
2. Predicting College Plans from Parental Encouragement, Sex, SocioEconomic status, and IQ.
Goodness of Fit Statistics for determining the effect of A, B, C on D (A = Social class, B = IQ, C = Sex, D = Parental Encouragement, E = College Plans) 1. [ABC][AD][BD][CD] (24 df; G2 = 55.81) 2. [ABC][ABD][CD] (15 df; G2 = 34.60) 3. [ABC][BCD][ACD] (18 df; G2 = 31.48) 4. [ABC][ABD][BCD] (12 df; G2 = 22.44) 5. [ABC][ABD][ACD] (12 df; G2 = 22.45) 6. [ABC][ABD][ACD][BCD] (9 df; G2 = 9.22)
Logit Parameters: Model [ABC][ABD][ACD][BCD]
Constant term wABC|D = 0.124 Main Effects Social Class L LM UM H w1(i)
ABC|D = -1.178, -0.384, 0.222, 1.340 IQ L LM UM H w2(j)
ABC|D = -0.772, -0.226, 0.210, 0.788 Sex M F w3(k)
ABC|D = 0.304, -0.304
Two factor Interactions
IQ by Social Class IQ L LM UM H L -0.016 -0.098 -0.058 -0.026 Social LM 0.066 0.032 0.144 -0.244 Class UM 0.074 -0.044 -0.138 0.108 H -0.126 -0.086 0.048 0.164
Social Class by Sex Sex M F L 0.140 -0.140 Social LM -0.052 0.052 Class UM 0.018 -0.018 H -0.106 0.106
IQ by Sex Sex M F L -0.126 0.126 IQ LM -0.016 0.016 UM 0.018 -0.018 H 0.122 -0.122
Goodness of Fit Statistics for determining the effect of A, B, C, D on E (A = Social class, B = IQ, C = Sex, D = Parental Encouragement, E = College Plans) 7. [ABCD][E][CD] (63 df; G2 = 4497.51) 8. [ABCD][AE][BE][CE][DE] (55 df; G2 = 73.82) 9. [ABCD][BCE][AE][DE] (52 df; G2 = 59.55)
Logit Parameters for Predicting College Plans Using Model 9:[ABCD][BCE][AE][DE]
Constant term wABCD|E = - 1.292 Main Effects Social Class L LM UM H w1(i)
ABCD|E = -0.650, -0.200, 0.062, 0.790 IQ L LM UM H w2(j)
ABCD|E = -0.840, -0.300, 0.266, 0.876 Sex M F w3(k)
ABCD|E = 0.082, -0.082 Parental Encouragement L H w4(l)
ABCD|E = -1.214, 1.214
Two Factor Interactions IQ by Sex Sex M F L -0.134 0.134 IQ LM -0.078 0.078 UM 0.094 -0.094 H 0.118 -0.118