Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Stat 504, Lecture 11 1'
&
$
%
Key Concept:
• Logistic Regression for I × J tables
Stat 504, Lecture 11 2'
&
$
%
Summary of Chi-square test of independence
for I × J tables:
• Chi square tests are used to test whether two
categorical variables measured on a group of
subjects are independent.
• Null hypothesis, H0: X is independent of Y
(refer to Lecture 8(36), for equivalent statements)
• Construct a contingency table that counts
numbers of subjects for each combination of
levels of variable X and variable Y
• If X and Y are independent, then the probability
distribution of X is the same for each Y (and
vice versa).
• Calculate expected number of subjects in each
cell if X and Y are independent:
Expected value =(row count * column count)
(total count)
• The chi square statistic is the sum of
(observed − expected)2
expected
over all cells in the table
Stat 504, Lecture 11 3'
&
$
%
• The null distribution of the chi-square statistic
(e.g. X2 or G2) is the chi square distribution
with (I − 1) ∗ (J − 1) df, where I is the number of
rows and J is the number of columns in the
contingency table.
• If the observed chi-square statistic is more
extreme than some chosen theoretical value of the
null distribution, then reject the hypothesis of
independence (e.g. for α = 0.05, χ2
1 = 3.96, if
X2
1 > χ2
1 or equivalently p-value< 0.05 then reject
H0).
• SAS: under PROC FREQ, option CHISQ or in
SAS Analyst (Statistics/Table Analysis with
Statistics:ChiSquare)
• For limitations of chi-square test and statistics
refer to Lecture 10(24)
• If the variables are dependent, further investigate
the direction and magnitude of associations (e.g.
difference in proportions, relative risk, odds
ratios, partitioning chi-square, logistic regression,
log-linear models, etc...)
Stat 504, Lecture 11 4'
&
$
%
From the last lecture:
Example 2. The table below classifies 5375 high
school students according to the smoking behavior of
the student (Z) and the smoking behavior of the
student’s parents (Y ).
Student smokes?
How many parents smoke? Yes (Z = 1) No (Z = 2)
Both (Y = 1) 400 1380
One (Y = 2) 416 1823
Neither (Y = 3) 188 1168
The test for independence yields X2 = 37.6 and
G2 = 38.4 with 2 df (p-values are essentially zero), so
Y and Z are related.
Stat 504, Lecture 11 5'
&
$
%
It is natural to think of Z as a response and Y as a
predictor, so we will discuss the conditional
distribution of Z given Y . Let
p1 = P (Z = 1 |Y = 1),
p2 = P (Z = 1 |Y = 2),
p3 = P (Z = 1 |Y = 3).
The estimates of these probabilities are
p̂1 = 400/1780 = .225,
p̂2 = 416/2239 = .186,
p̂3 = 188/1356 = .139.
Stat 504, Lecture 11 6'
&
$
%
The effect of Y on Z can be summarized with two
differences. For example, we can calculate the
increase in the probability of Z = 1 as Y goes from 3
to 2, and as Y goes from 2 to 1:
d̂23 = p̂2 − p̂3 = .047
d̂12 = p̂1 − p̂2 = .039
Alternatively, we may treat Y = 3 as a baseline and
calculate the increase in probability as we go from
Y = 3 to Y = 2 and from Y = 3 to Y = 1:
d̂23 = p̂2 − p̂3 = .047
d̂13 = p̂1 − p̂3 = .086
Stat 504, Lecture 11 7'
&
$
%
We may also express the effects as odds ratios:
θ̂23 =416 × 1168
188 × 1823= 1.42,
θ̂13 =400 × 1168
188 × 1380= 1.80.
Students with one smoking parent are estimated to be
42% more likely (on the odds scale) to smoke than
students whose parents do not smoke, and students
with two smoking parents are 80% more likely to
smoke than students whose parents do not smoke.
Stat 504, Lecture 11 8'
&
$
%
Another way to describe the effects is to perform
”partitioned tests”, that is to form a sequence of
smaller tables by combining rows and/or columns in a
meaningful way.
In our example we might be interested in exploring
the association in students smoking behavior
depending if neither parent smokes versus at least one
parent smoking. Thus we can combine the first two
rows of our 3 × 2 table and look at a new 2 × 2 table:
Student smokes Student doesn’t
1–2 parents smoke 816 3203
Neither parent smokes 188 1168
This table has X2 = 27.7, G2 = 29.1, p-value≈ 0, and
θ̂ = 1.58. We estimate that a student is 58% more
likely, on the odds scale, to smoke if he or she has at
least one smoking parent.
Stat 504, Lecture 11 9'
&
$
%
But what if
• we want to model the probabilities of a response
variable as a function of some explanatory
variables, e.g. ”risk” of student smoking as a
function of parents’ behavior.
• we want to perform descriptive discriminate
analyses such as describing the differences
between individuals in separate groups as a
function of explanatory variables, e.g. student
smokers and nonsmokers as a function of parents
smoking behavior
• we want to predict probabilities that individuals
fall into two categories of the binary response as
a function of some explanatory variables, e.g.
what is the probability that a student is a smoker
given that neither of his/her parents smokes.
• we want to classify individuals into two categories
based on explanatory variables, e.g. classify new
students into ”smoking” or ”nonsmoking” group
depending on parents smoking behavior.
• we want to develop a social network model, adjust
for ”bias”, analyze choice data, etc...
Stat 504, Lecture 11 10'
&
$
%
Logistic Regression
(ref. Chs. 5 - 7, Agresti) is another way we can
model the probabilities of a response variable
as a function of some explanatory variables. For
example, what is the probability that the child
smokes given that at least one parent smokes.
Logistic regression is a special type of generalized
linear models (GLM).
For now we’ll only focus on modeling the
probabilities of a binary response variable as a
function of another discrete variable, in order to
see how logistic regression ties in with the test of
independence and measures of associations in
two-way tables.
Stat 504, Lecture 11 11'
&
$
%
Now, suppose we arrange the data like this,
yi ni
1–2 parents smoke 816 4019
Neither parent smokes 188 1356
where yi is the number of children who smoke, ni is
the number of children, and πi = P (yi = 1|xi) odds of
smoking, where i = 1, 2. Then we suppose that
yi ∼ Bin(ni, πi),
and let X be a dummy variable
Xi =
8
<
:
0 if neither parent smokes,
1 if at least one parent smokes.
Then the logistic regression model is
logit(πi) = logπi
1 − πi
= β0 + β1Xi,
or
πi =exp(β0 + β1xi)
1 + exp(β0 + β1xi)
which says that log-odds of smoking are β0 for
”smoking parents”=none, and β0 + β1 for ”smoking
parents”=at least one.
Stat 504, Lecture 11 12'
&
$
%
We can fit the model in SAS like this:
(ref: lec10ex2.sas)
data smoke;
input s $ y n ;
cards;
smoke 816 4019
nosmoke 188 1356
;
proc logistic descending;
class s (ref=first) / param=ref;
model y/n = s /scale=none;
run;
In the data step, the dollar sign $ indicates that S is a
character-string variable.
In the logistic step, the statement
• descending
insures that you are modeling a probability of an
”event” which takes value 1, otherwise by default
SAS models the probability of ”nonevent”
• class S / param=ref ref=first;
says that S should be coded as a dummy variable
using the first category as the reference or zero
group. (The first category is “nosmoke,” because
it comes before “smoke” in alphabetical order.)
Stat 504, Lecture 11 13'
&
$
%
• model y/n
Because we have grouped data (i.e. multiple trials
per line of the data set), the model statement
uses the “event/trial” syntax, in which y/n
appears on the left-hand side of the equal sign.
The predictors go on the right-hand side,
separated by spaces if there are more than one.
An intercept is added automatically by default.
Stat 504, Lecture 11 14'
&
$
%
Let’s look at some output from this program:
• Model information
• Response profile
• Class Level Information
• Model convergence
• Goodness of fit-statistics
• Model fit
• Testing null hypothesis beta=0
• Analysis of maximum likelihood estimates
• Odds Ratio Estimates
Stat 504, Lecture 11 15'
&
$
%
Model Information
Data Set WORK.SMOKE
Response Variable (Events) y
Response Variable (Trials) n
Number of Observations 2
Model binary logit
Optimization Technique Fisher’s scoring
Response Profile
Ordered Binary Total
Value Outcome Frequency
1 Event 1004
2 Nonevent 4371
Class Level Information
Design
Variables
Class Value 1
s nosmoke 0
smoke 1
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Number of events/trials observations: 2
Model Fit Statistics
Intercept
Intercept sand
Criterion Only Covariates
AIC 5178.510 5151.390
SC 5185.100 5164.569
-2 Log L 5176.510 5147.390
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 29.1207 1 <.0001
Score 27.6766 1 <.0001
Wald 27.3361 1 <.0001
Stat 504, Lecture 11 16'
&
$
%
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -1.8266 0.0786 540.2949 <.0001
s smoke 1 0.4592 0.0878 27.3361 <.0001
Our logistic regression model:
log(πi/(1 − πi) = −1.826 + 0.4592Xi
The estimated coefficient of the dummy variable,
β̂1 = 0.4592,
agrees exactly with the log-odds ratio from the 2 × 2
table (e.g. ln(1.58) = 816×1168
188×3203= 0.459. The standard
error for β̂1, 0.0878, agrees exactly with the standard
error that you can calculate from the 2 × 2 table.
This is not surprising, because in the logistic
regression model β1 is the difference in the log-odds of
children smoking as we move from ”nosmoke” (i.e.
neither parent smokes) (Xi = 0) to ”smoke” (i.e. at
least one parent smokes) (Xi = 1), and the difference
in log-odds is a log-odds ratio.
Stat 504, Lecture 11 17'
&
$
%
Also, in this model, β0 is the log-odds of children
smoking for no-smoking parents (Xi = 0). Looking at
the 2× 2 table, the estimated log-odds for nonsmokers
is
log
„
188
1168
«
= log(0.161) = −1.8266,
which agrees with β̂0 from the logistic model.
Stat 504, Lecture 11 18'
&
$
%
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 29.1207 1 <.0001
Score 27.6766 1 <.0001
Wald 27.3361 1 <.0001
This tests that a set of coefficients is simultaneously
zero, e.g. H0 = β1 = β2... = βk versus alternative that
at least on of them is nonzero. In our example, since
we have only a single covariate in this mode, this is
equivalent to testing β1 = 0 and this is the same as
testing that Y and X are independent.
Notice that ”likelihood ratio” matches G2 we
calculated in the last lecture, ”score” is also a
statistics approximately with chi-squared distribution.
We’ll discuss these in details a bit later.
The Wald test compares the statistic
z =β̂j
SE(β̂j)
to a standard normal distribution; the p-value is twice
the area to the right of |z| under the normal curve.
Stat 504, Lecture 11 19'
&
$
%
The goodness-of-fit statistics X2 and G2 from this
model are both zero, because the model is saturated.
However, suppose that we fit the intercept-only
model. This is accomplished by removing the
predictor from the model statement, like this:
model y/n = / scale=none;
The goodness-of-fit statistics are shown below.
Deviance and Pearson Goodness-of-Fit Statistics
Criterion DF Value Value/DF Pr > ChiSq
Deviance 1 29.1207 29.1207 <.0001
Pearson 1 27.6766 27.6766 <.0001
The Pearson statistic X2 = 27.6766 is precisely equal
to the ordinary X2 for testing independence in the
2 × 2 table. And the deviance G2 = 29.1207 is
precisely equal to the G2 for testing independence in
the 2 × 2 table.
Thus we have shown that analyzing a 2 × 2 table for
relatedness is equivalent to logistic regression with a
dummy variable.
Stat 504, Lecture 11 20'
&
$
%
Our logistic regression model:
log(πi/(1 − πi) = −1.826 + 0.4592Xi
Interpretation of coefficients:
For every one-unit increase in the explanatory
variable X1 (e.g. changing from no smoking parents
to smoking parents), the odds of ”success” πi/(1− πi)
will be multiplied by exp(β1), given that all the other
variables are held constant.
For our example, exp(0.4592) = 1.5828 which are
odds we already calculated.
Further, the predicted probability of a child being asmoker if at least one parent smokes is
P (Yi = 1|Xi = 1) =exp(−1.826 + 0.4592(Xi = 1))
(1 + exp(−1.826 + 0.4592(Xi = 1)))= 0.20
See lec10ex2.sas for the commands on using the OUTPUT
statement.
Stat 504, Lecture 11 21'
&
$
%
Now let’s replicate the analysis of the original 3 × 2
tables with logistic regression.
First, we re-express the data in terms of yi=number
of smoking students, and ni=number of students for
three groups based on parents behavior:
Student smokes?
How many parents smoke? yi ni
Both 400 1780
One 416 2239
Neither 188 1356
Then we decide on a baseline level for the explanatory
variable X, and create k − 1 dummy indicators if X is
a categorical variable with k levels. For our example,
let’s parent smoking=Neither be a baseline, and define
a pair of dummy indicators,
X1 =
8
<
:
1 if parent smoking=One ,
0 otherwise,
X2 =
8
<
:
1 if parent smoking=Both,
0 otherwise.
Stat 504, Lecture 11 22'
&
$
%
Let π = odds of student smoking. Then the model
log
„
π
1 − π
«
= β0 + β1X1 + β2X2
says that the log-odds of student smoking are β0 for
parents smoking=neither, β0 + β1 for parents
smoking=one and β0 + β2 for parents smoking=both.
Therefore,
β1 = log-odds for one
− log-odds for neither
β2 = log-odds for both
− log-odds for neither,
and we expect to get β̂1 = ln(1.42) = .351 and
β̂2 = ln(1.80) = .588. The estimated intercept should
be
β̂0 = log(188/1168) = −1.826
Stat 504, Lecture 11 23'
&
$
%
Here are two versions of a SAS program for fitting
this model
(ref: lec11ex2v1.sas):
data smoke;
input s $ y n ;
cards;
2 400 1780
1 416 2239
0 188 1356
;
proc logistic descending;
class s (ref=first)/ param=ref;
model y/n = s /scale=none;
output out=predict pred=prob;
run;
proc print data=predict;
run;
(ref: lec11ex2.sas):
proc logistic descending;
class s (ref=’neither’) / order=data param=ref;
model y/n = s /scale=none;
output out=predict pred=prob;
run;
proc print data=predict;
run;
Stat 504, Lecture 11 24'
&
$
%
In the class statement, the option order=data tells
SAS to sort the categories of S by the order in which
they appear in the dataset rather than alphabetical
order. The option param=ref tells SAS to create a set
of two dummy variables to distinguish among the
three categories. The option ref=’neither’ makes
neither the reference group (i.e. the group for which
both dummy variables are zero).
Let’s look at some relevant portions of the output of
lec11ex2v1.lst:
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -1.8266 0.0786 540.2949 <.0001
s 1 1 0.3491 0.0955 13.3481 0.0003
s 2 1 0.5882 0.0970 36.8105 <.0001
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
s 1 vs 0 1.418 1.176 1.710
s 2 vs 0 1.801 1.489 2.178
Stat 504, Lecture 11 25'
&
$
%
The saturated model ls,
logit(π) = −1.8266 + 0.3491X1 + 0.5882X2
For example, the predicted probability of a student
smoking given that only one parent is smoking is
P (Yi = 1|neither = 0, one = 1, both = 0)
= P (Yi = 1|X1 = 1, X2 = 0)
=exp (−1.8266 + 0.3491)
1 + exp (−1.8266 + 0.3491)(1)
In this case, the “intercept only” model says that
delinquency is unrelated to socioeconomic status, so
the test of the global null hypothesis β1 = β2 = 0 is
equivalent to the usual test for independence in the
3× 2 table. The estimated coefficients and SE’s are as
we predicted, as well as the estimated odds ratios.
Stat 504, Lecture 11 26'
&
$
%
If we include the statement
output out=predict pred=phat
reschi=pearson resdev=deviance;
in the PROC LOGISTIC call, then SAS creates a
new dataset called “results” that includes all of
the variables in the original dataset, the predicted
probabilities π̂i, the Pearson residuals and the
deviance residuals. Then we can add some code
to calculate and print out the estimated expected
number of successes µ̂i = niπ̂i and failures
ni − µ̂i = ni(1 − π̂i).
Stat 504, Lecture 11 27'
&
$
%
A revised SAS program that does all this is shown
below:
(ref: lec11ex2v2.sas)
data smoke;
input s $ y n ;
cards;
2 400 1780
1 416 2239
0 188 1356
;
proc logistic descending;
class s (ref=first)/ param=ref;
model y/n = s /scale=none;
output out=predict pred=prob reschi=pearson resdev=deviance;;
run;
data diagnostics;
set predict;
shat = n*prob;
fhat = n*(1-prob);
run;
proc print data=diagnostics;
var s y n prob shat fhat pearson deviance;;
run;
Stat 504, Lecture 11 28'
&
$
%
Running this program gives a new output section:
Obs s y n prob shat fhat pearson deviance
1 2 400 1780 0.22472 400.000 1380.00 -.000000031 0
2 1 416 2239 0.18580 416.000 1823.00 -3.6617E-15 0
3 0 188 1356 0.13864 188.000 1168.00 -.000001291 -.000001307
Most of the “shat” and “fhat” values are greater than
5.0, so the χ2 approximation is trustworthy.
Stat 504, Lecture 11 29'
&
$
%
Testing H0 : βj = 0 versus H1 : βj 6= 0.
The Wald chisquare statistics z2 = (β̂j/SE(β̂k))2 for
these tests are displayed along with the estimated
coefficients in the “Analysis of Maximum Likelihood
Estimates” section. A value of z2 bigger than 3.84
indicates that we can reject the null hypothesis βj = 0
at the .05-level.
Stat 504, Lecture 11 30'
&
$
%
Testing the joint significance of all predictors.
In SAS output: Testing Global Null Hypothesis:
BETA=0/Model Fit Statistics/Overall
Goodness-of-Fit Statistics
In our model
log
„
π
1 − π
«
= β0 + β1X1 + β2X2,
this is the test of H0 : β1 = β2 = 0 versus the
alternative that at least one of the coefficients
β1, . . . , βp is not zero.
In other words, this is testing the null hypothesis that
an intercept-only model is correct,
log
„
π
1 − π
«
= β0
versus the alternative that the current model is
correct
log
„
π
1 − π
«
= β0 + β1X1 + β2X2,
Stat 504, Lecture 11 31'
&
$
%
In the SAS output, three different chisquare statistics
for this test are displayed in the section “Testing
Global Null Hypothesis: Beta=0,” corresponding to
the the likelihood ratio, score and Wald tests. This
test has k degrees of freedom (e.g. the number of
dummy indicators, that is the number of
β-parameters (except the intercept).
Large chisquare statistics lead to small p-values and
provide evidence against the intercept-only model in
favor of the current model.
If these three tests agree, that’s evidence that the
large-sample approximations are working well and the
results are trustworthy. If the results from the three
tests disagree, most statisticians would tend to trust
the likelihood-ratio test more than the other two.
Stat 504, Lecture 11 32'
&
$
%
Testing that an arbitrary group of coefficients
is zero. To test the null hypothesis that a group of k
coefficients is zero, we need to fit two models:
• the reduced model which omits the k predictors
in question, and
• the current model which includes them.
Stat 504, Lecture 11 33'
&
$
%
The null hypothesis is that the reduced model is true;
the alternative is that the current model is true.
To perform the test, we must look at the “Model Fit
Statistics” section and examine the value of “−2 Log
L” for “Intercept and Covariates.” The
likelihood-ratio statistic is
∆G2 = −2 log L from reduced model
− (−2 log L from current model)
and the degrees of freedom is k (the number of
coefficients in question). The p-value is P (χ2
k ≥ ∆G2).
Larger values of ∆G2 lead to small p-values, which
provide evidence against the reduced model in favor
of the current model.
For our example,∆G2 = 5176.510 − 5138.144 = 38.3658 withdf = 3 − 1 = 2. Notice that this matches
Likelihood Ratio 38.3658 2 <.0001
from ”Testing Global Hypothesis: BETA=0” section.
Stat 504, Lecture 11 34'
&
$
%
Another way to calculate the test statistic is
∆G2 = G2 from reduced model
−G2 from current model,
where the G2’s are the overall goodness-of-fit
statistics which we will mention in the next
lecture.