Key Concept - Pennsylvania State Universitypersonal.psu.edu/abs12/stat504/Lecture/lec11.pdf · Stat 504, Lecture 11 10 Logistic Regression (ref. Chs. 5 - 7, Agresti) is another way

Stat 504, Lecture 11 1'

&

$

%

Key Concept:

• Logistic Regression for I × J tables


&

$

%

Summary of Chi-square test of independence

for I × J tables:

• Chi square tests are used to test whether two

categorical variables measured on a group of

subjects are independent.

• Null hypothesis, H0: X is independent of Y

(refer to Lecture 8(36), for equivalent statements)

• Construct a contingency table that counts

numbers of subjects for each combination of

levels of variable X and variable Y

• If X and Y are independent, then the probability

distribution of X is the same for each Y (and

vice versa).

• Calculate expected number of subjects in each

cell if X and Y are independent:

Expected value =(row count * column count)

(total count)

• The chi square statistic is the sum of

(observed − expected)2

expected

over all cells in the table


&

$

%

• The null distribution of the chi-square statistic

(e.g. X2 or G2) is the chi square distribution

with (I − 1) ∗ (J − 1) df, where I is the number of

rows and J is the number of columns in the

contingency table.

• If the observed chi-square statistic is more

extreme than some chosen theoretical value of the

null distribution, then reject the hypothesis of

independence (e.g. for α = 0.05, χ2

1 = 3.96, if

X2

1 > χ2

1 or equivalently p-value< 0.05 then reject

H0).

• SAS: under PROC FREQ, option CHISQ or in

SAS Analyst (Statistics/Table Analysis with

Statistics:ChiSquare)

• For limitations of chi-square test and statistics

refer to Lecture 10(24)

• If the variables are dependent, further investigate

the direction and magnitude of associations (e.g.

difference in proportions, relative risk, odds

ratios, partitioning chi-square, logistic regression,

log-linear models, etc...)


&

$

%

From the last lecture:

Example 2. The table below classifies 5375 high

school students according to the smoking behavior of

the student (Z) and the smoking behavior of the

student’s parents (Y ).

Student smokes?

How many parents smoke? Yes (Z = 1) No (Z = 2)

Both (Y = 1) 400 1380

One (Y = 2) 416 1823

Neither (Y = 3) 188 1168

The test for independence yields X2 = 37.6 and

G2 = 38.4 with 2 df (p-values are essentially zero), so

Y and Z are related.


&

$

%

It is natural to think of Z as a response and Y as a

predictor, so we will discuss the conditional

distribution of Z given Y . Let

p1 = P (Z = 1 |Y = 1),

p2 = P (Z = 1 |Y = 2),

p3 = P (Z = 1 |Y = 3).

The estimates of these probabilities are

p̂1 = 400/1780 = .225,

p̂2 = 416/2239 = .186,

p̂3 = 188/1356 = .139.


&

$

%

The effect of Y on Z can be summarized with two

differences. For example, we can calculate the

increase in the probability of Z = 1 as Y goes from 3

to 2, and as Y goes from 2 to 1:

d̂23 = p̂2 − p̂3 = .047

d̂12 = p̂1 − p̂2 = .039

Alternatively, we may treat Y = 3 as a baseline and

calculate the increase in probability as we go from

Y = 3 to Y = 2 and from Y = 3 to Y = 1:

d̂23 = p̂2 − p̂3 = .047

d̂13 = p̂1 − p̂3 = .086


&

$

%

We may also express the effects as odds ratios:

θ̂23 =416 × 1168

188 × 1823= 1.42,

θ̂13 =400 × 1168

188 × 1380= 1.80.

Students with one smoking parent are estimated to be

42% more likely (on the odds scale) to smoke than

students whose parents do not smoke, and students

with two smoking parents are 80% more likely to

smoke than students whose parents do not smoke.


&

$

%

Another way to describe the effects is to perform

”partitioned tests”, that is to form a sequence of

smaller tables by combining rows and/or columns in a

meaningful way.

In our example we might be interested in exploring

the association in students smoking behavior

depending if neither parent smokes versus at least one

parent smoking. Thus we can combine the first two

rows of our 3 × 2 table and look at a new 2 × 2 table:

Student smokes Student doesn’t

1–2 parents smoke 816 3203

Neither parent smokes 188 1168

This table has X2 = 27.7, G2 = 29.1, p-value≈ 0, and

θ̂ = 1.58. We estimate that a student is 58% more

likely, on the odds scale, to smoke if he or she has at

least one smoking parent.


&

$

%

But what if

• we want to model the probabilities of a response

variable as a function of some explanatory

variables, e.g. ”risk” of student smoking as a

function of parents’ behavior.

• we want to perform descriptive discriminate

analyses such as describing the differences

between individuals in separate groups as a

function of explanatory variables, e.g. student

smokers and nonsmokers as a function of parents

smoking behavior

• we want to predict probabilities that individuals

fall into two categories of the binary response as

a function of some explanatory variables, e.g.

what is the probability that a student is a smoker

given that neither of his/her parents smokes.

• we want to classify individuals into two categories

based on explanatory variables, e.g. classify new

students into ”smoking” or ”nonsmoking” group

depending on parents smoking behavior.

• we want to develop a social network model, adjust

for ”bias”, analyze choice data, etc...


&

$

%

Logistic Regression

(ref. Chs. 5 - 7, Agresti) is another way we can

model the probabilities of a response variable

as a function of some explanatory variables. For

example, what is the probability that the child

smokes given that at least one parent smokes.

Logistic regression is a special type of generalized

linear models (GLM).

For now we’ll only focus on modeling the

probabilities of a binary response variable as a

function of another discrete variable, in order to

see how logistic regression ties in with the test of

independence and measures of associations in

two-way tables.


&

$

%

Now, suppose we arrange the data like this,

yi ni

1–2 parents smoke 816 4019

Neither parent smokes 188 1356

where yi is the number of children who smoke, ni is

the number of children, and πi = P (yi = 1|xi) odds of

smoking, where i = 1, 2. Then we suppose that

yi ∼ Bin(ni, πi),

and let X be a dummy variable

Xi =

8

<

:

0 if neither parent smokes,

1 if at least one parent smokes.

Then the logistic regression model is

logit(πi) = logπi

1 − πi

= β0 + β1Xi,

or

πi =exp(β0 + β1xi)

1 + exp(β0 + β1xi)

which says that log-odds of smoking are β0 for

”smoking parents”=none, and β0 + β1 for ”smoking

parents”=at least one.


&

$

%

We can fit the model in SAS like this:

(ref: lec10ex2.sas)

data smoke;

input s $ y n ;

cards;

smoke 816 4019

nosmoke 188 1356

;

proc logistic descending;

class s (ref=first) / param=ref;

model y/n = s /scale=none;

run;

In the data step, the dollar sign $ indicates that S is a

character-string variable.

In the logistic step, the statement

• descending

insures that you are modeling a probability of an

”event” which takes value 1, otherwise by default

SAS models the probability of ”nonevent”

• class S / param=ref ref=first;

says that S should be coded as a dummy variable

using the first category as the reference or zero

group. (The first category is “nosmoke,” because

it comes before “smoke” in alphabetical order.)


&

$

%

• model y/n

Because we have grouped data (i.e. multiple trials

per line of the data set), the model statement

uses the “event/trial” syntax, in which y/n

appears on the left-hand side of the equal sign.

The predictors go on the right-hand side,

separated by spaces if there are more than one.

An intercept is added automatically by default.


&

$

%

Let’s look at some output from this program:

• Model information

• Response profile

• Class Level Information

• Model convergence

• Goodness of fit-statistics

• Model fit

• Testing null hypothesis beta=0

• Analysis of maximum likelihood estimates

• Odds Ratio Estimates


&

$

%

Model Information

Data Set WORK.SMOKE

Response Variable (Events) y

Response Variable (Trials) n

Number of Observations 2

Model binary logit

Optimization Technique Fisher’s scoring

Response Profile

Ordered Binary Total

Value Outcome Frequency

1 Event 1004

2 Nonevent 4371

Class Level Information

Design

Variables

Class Value 1

s nosmoke 0

smoke 1

Model Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.

Number of events/trials observations: 2

Model Fit Statistics

Intercept

Intercept sand

Criterion Only Covariates

AIC 5178.510 5151.390

SC 5185.100 5164.569

-2 Log L 5176.510 5147.390

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 29.1207 1 <.0001

Score 27.6766 1 <.0001

Wald 27.3361 1 <.0001


&

$

%

Analysis of Maximum Likelihood Estimates

Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -1.8266 0.0786 540.2949 <.0001

s smoke 1 0.4592 0.0878 27.3361 <.0001

Our logistic regression model:

log(πi/(1 − πi) = −1.826 + 0.4592Xi

The estimated coefficient of the dummy variable,

β̂1 = 0.4592,

agrees exactly with the log-odds ratio from the 2 × 2

table (e.g. ln(1.58) = 816×1168

188×3203= 0.459. The standard

error for β̂1, 0.0878, agrees exactly with the standard

error that you can calculate from the 2 × 2 table.

This is not surprising, because in the logistic

regression model β1 is the difference in the log-odds of

children smoking as we move from ”nosmoke” (i.e.

neither parent smokes) (Xi = 0) to ”smoke” (i.e. at

least one parent smokes) (Xi = 1), and the difference

in log-odds is a log-odds ratio.


&

$

%

Also, in this model, β0 is the log-odds of children

smoking for no-smoking parents (Xi = 0). Looking at

the 2× 2 table, the estimated log-odds for nonsmokers

is

log

„

188

1168

«

= log(0.161) = −1.8266,

which agrees with β̂0 from the logistic model.


&

$

%

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq


Score 27.6766 1 <.0001

Wald 27.3361 1 <.0001

This tests that a set of coefficients is simultaneously

zero, e.g. H0 = β1 = β2... = βk versus alternative that

at least on of them is nonzero. In our example, since

we have only a single covariate in this mode, this is

equivalent to testing β1 = 0 and this is the same as

testing that Y and X are independent.

Notice that ”likelihood ratio” matches G2 we

calculated in the last lecture, ”score” is also a

statistics approximately with chi-squared distribution.

We’ll discuss these in details a bit later.

The Wald test compares the statistic

z =β̂j

SE(β̂j)

to a standard normal distribution; the p-value is twice

the area to the right of |z| under the normal curve.


&

$

%

The goodness-of-fit statistics X2 and G2 from this

model are both zero, because the model is saturated.

However, suppose that we fit the intercept-only

model. This is accomplished by removing the

predictor from the model statement, like this:

model y/n = / scale=none;

The goodness-of-fit statistics are shown below.

Deviance and Pearson Goodness-of-Fit Statistics

Criterion DF Value Value/DF Pr > ChiSq

Deviance 1 29.1207 29.1207 <.0001

Pearson 1 27.6766 27.6766 <.0001

The Pearson statistic X2 = 27.6766 is precisely equal

to the ordinary X2 for testing independence in the

2 × 2 table. And the deviance G2 = 29.1207 is

precisely equal to the G2 for testing independence in

the 2 × 2 table.

Thus we have shown that analyzing a 2 × 2 table for

relatedness is equivalent to logistic regression with a

dummy variable.


&

$

%

Our logistic regression model:

log(πi/(1 − πi) = −1.826 + 0.4592Xi

Interpretation of coefficients:

For every one-unit increase in the explanatory

variable X1 (e.g. changing from no smoking parents

to smoking parents), the odds of ”success” πi/(1− πi)

will be multiplied by exp(β1), given that all the other

variables are held constant.

For our example, exp(0.4592) = 1.5828 which are

odds we already calculated.

Further, the predicted probability of a child being asmoker if at least one parent smokes is

P (Yi = 1|Xi = 1) =exp(−1.826 + 0.4592(Xi = 1))

(1 + exp(−1.826 + 0.4592(Xi = 1)))= 0.20

See lec10ex2.sas for the commands on using the OUTPUT

statement.


&

$

%

Now let’s replicate the analysis of the original 3 × 2

tables with logistic regression.

First, we re-express the data in terms of yi=number

of smoking students, and ni=number of students for

three groups based on parents behavior:

Student smokes?

How many parents smoke? yi ni

Both 400 1780

One 416 2239

Neither 188 1356

Then we decide on a baseline level for the explanatory

variable X, and create k − 1 dummy indicators if X is

a categorical variable with k levels. For our example,

let’s parent smoking=Neither be a baseline, and define

a pair of dummy indicators,

X1 =

8

<

:

1 if parent smoking=One ,

0 otherwise,

X2 =

8

<

:

1 if parent smoking=Both,

0 otherwise.


&

$

%

Let π = odds of student smoking. Then the model

log

„

π

1 − π

«

= β0 + β1X1 + β2X2

says that the log-odds of student smoking are β0 for

parents smoking=neither, β0 + β1 for parents

smoking=one and β0 + β2 for parents smoking=both.

Therefore,

β1 = log-odds for one

− log-odds for neither

β2 = log-odds for both

− log-odds for neither,

and we expect to get β̂1 = ln(1.42) = .351 and

β̂2 = ln(1.80) = .588. The estimated intercept should

be

β̂0 = log(188/1168) = −1.826


&

$

%

Here are two versions of a SAS program for fitting

this model

(ref: lec11ex2v1.sas):

data smoke;

input s $ y n ;

cards;

2 400 1780

1 416 2239

0 188 1356

;


class s (ref=first)/ param=ref;


output out=predict pred=prob;

run;

proc print data=predict;

run;

(ref: lec11ex2.sas):


class s (ref=’neither’) / order=data param=ref;


output out=predict pred=prob;

run;

proc print data=predict;

run;


&

$

%

In the class statement, the option order=data tells

SAS to sort the categories of S by the order in which

they appear in the dataset rather than alphabetical

order. The option param=ref tells SAS to create a set

of two dummy variables to distinguish among the

three categories. The option ref=’neither’ makes

neither the reference group (i.e. the group for which

both dummy variables are zero).

Let’s look at some relevant portions of the output of

lec11ex2v1.lst:

Analysis of Maximum Likelihood Estimates

Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -1.8266 0.0786 540.2949 <.0001

s 1 1 0.3491 0.0955 13.3481 0.0003

s 2 1 0.5882 0.0970 36.8105 <.0001

Odds Ratio Estimates

Point 95% Wald

Effect Estimate Confidence Limits

s 1 vs 0 1.418 1.176 1.710

s 2 vs 0 1.801 1.489 2.178


&

$

%

The saturated model ls,

logit(π) = −1.8266 + 0.3491X1 + 0.5882X2

For example, the predicted probability of a student

smoking given that only one parent is smoking is

P (Yi = 1|neither = 0, one = 1, both = 0)

= P (Yi = 1|X1 = 1, X2 = 0)

=exp (−1.8266 + 0.3491)

1 + exp (−1.8266 + 0.3491)(1)

In this case, the “intercept only” model says that

delinquency is unrelated to socioeconomic status, so

the test of the global null hypothesis β1 = β2 = 0 is

equivalent to the usual test for independence in the

3× 2 table. The estimated coefficients and SE’s are as

we predicted, as well as the estimated odds ratios.


&

$

%

If we include the statement

output out=predict pred=phat

reschi=pearson resdev=deviance;

in the PROC LOGISTIC call, then SAS creates a

new dataset called “results” that includes all of

the variables in the original dataset, the predicted

probabilities π̂i, the Pearson residuals and the

deviance residuals. Then we can add some code

to calculate and print out the estimated expected

number of successes µ̂i = niπ̂i and failures

ni − µ̂i = ni(1 − π̂i).


&

$

%

A revised SAS program that does all this is shown

below:

(ref: lec11ex2v2.sas)

data smoke;

input s $ y n ;

cards;

2 400 1780

1 416 2239

0 188 1356

;


class s (ref=first)/ param=ref;


output out=predict pred=prob reschi=pearson resdev=deviance;;

run;

data diagnostics;

set predict;

shat = n*prob;

fhat = n*(1-prob);

run;

proc print data=diagnostics;

var s y n prob shat fhat pearson deviance;;

run;


&

$

%

Running this program gives a new output section:

Obs s y n prob shat fhat pearson deviance

1 2 400 1780 0.22472 400.000 1380.00 -.000000031 0

2 1 416 2239 0.18580 416.000 1823.00 -3.6617E-15 0

3 0 188 1356 0.13864 188.000 1168.00 -.000001291 -.000001307

Most of the “shat” and “fhat” values are greater than

5.0, so the χ2 approximation is trustworthy.


&

$

%

Testing H0 : βj = 0 versus H1 : βj 6= 0.

The Wald chisquare statistics z2 = (β̂j/SE(β̂k))2 for

these tests are displayed along with the estimated

coefficients in the “Analysis of Maximum Likelihood

Estimates” section. A value of z2 bigger than 3.84

indicates that we can reject the null hypothesis βj = 0

at the .05-level.


&

$

%

Testing the joint significance of all predictors.

In SAS output: Testing Global Null Hypothesis:

BETA=0/Model Fit Statistics/Overall

Goodness-of-Fit Statistics

In our model

log

„

π

1 − π

«

= β0 + β1X1 + β2X2,

this is the test of H0 : β1 = β2 = 0 versus the

alternative that at least one of the coefficients

β1, . . . , βp is not zero.

In other words, this is testing the null hypothesis that

an intercept-only model is correct,

log

„

π

1 − π

«

= β0

versus the alternative that the current model is

correct

log

„

π

1 − π

«

= β0 + β1X1 + β2X2,


&

$

%

In the SAS output, three different chisquare statistics

for this test are displayed in the section “Testing

Global Null Hypothesis: Beta=0,” corresponding to

the the likelihood ratio, score and Wald tests. This

test has k degrees of freedom (e.g. the number of

dummy indicators, that is the number of

β-parameters (except the intercept).

Large chisquare statistics lead to small p-values and

provide evidence against the intercept-only model in

favor of the current model.

If these three tests agree, that’s evidence that the

large-sample approximations are working well and the

results are trustworthy. If the results from the three

tests disagree, most statisticians would tend to trust

the likelihood-ratio test more than the other two.


&

$

%

Testing that an arbitrary group of coefficients

is zero. To test the null hypothesis that a group of k

coefficients is zero, we need to fit two models:

• the reduced model which omits the k predictors

in question, and

• the current model which includes them.


&

$

%

The null hypothesis is that the reduced model is true;

the alternative is that the current model is true.

To perform the test, we must look at the “Model Fit

Statistics” section and examine the value of “−2 Log

L” for “Intercept and Covariates.” The

likelihood-ratio statistic is

∆G2 = −2 log L from reduced model

− (−2 log L from current model)

and the degrees of freedom is k (the number of

coefficients in question). The p-value is P (χ2

k ≥ ∆G2).

Larger values of ∆G2 lead to small p-values, which

provide evidence against the reduced model in favor

of the current model.

For our example,∆G2 = 5176.510 − 5138.144 = 38.3658 withdf = 3 − 1 = 2. Notice that this matches


from ”Testing Global Hypothesis: BETA=0” section.


&

$

%

Another way to calculate the test statistic is

∆G2 = G2 from reduced model

−G2 from current model,

where the G2’s are the overall goodness-of-fit

statistics which we will mention in the next

lecture.

Documents

Key Concept - Pennsylvania State Universitypersonal.psu.edu/abs12/stat504/Lecture/lec11.pdf · Stat 504, Lecture 11 10 Logistic Regression (ref. Chs. 5 - 7, Agresti) is another way