Upload
amato
View
42
Download
0
Embed Size (px)
DESCRIPTION
April 6. Logistic Regression Estimating probability based on logistic model Testing differences among multiple groups Assumptions for model. Logistic regression equation. Model log odds of outcome as a linear function of one or more variables X i = predictors, independent variables - PowerPoint PPT Presentation
Citation preview
April 6
• Logistic Regression– Estimating probability based on logistic model
– Testing differences among multiple groups
– Assumptions for model
Logistic regression equation
Model log odds of outcome as a linear function of one or more variables
Xi = predictors, independent variables
is increase in log odds of 1-unit increase in X
eis relative odds of a 1-unit increase in X
...)1
log( 22110
xx
The model is:
Logistic Regression PredictionEstimating Probability of Y=1
Goal: Estimate for a set of X values
Solve for
...)1
log( 22110
xx
The model is:
exp ( 0 + 1x1 + 2x2)
1 + exp ( 0 + 1x1 + 2x2)
ODDS
1 + ODDS=
Steps in Estimating
• Pick values for x1, x2, …, xp
• Compute log odds for your values of Xs using results– LO = b0 + b1x1 + b2x2 + … bpxp
• EXP LO to get odds– Odds = EXP (LO)
• Compute estimate of – = ODDS/(ODDS + 1)
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -6.0621 1.2884 22.1395 <.0001AGE 1 0.0605 0.0223 7.3310 0.0068women 1 -0.3967 0.3166 1.5701 0.2102
log(odds) = - 6.0621 + 0.0605*age –0.3967*women
What is estimated probability of CVD for a man 60 years old?
Log(odds) = -6.0621 + 0.0605(60) –0.3967(0) = -2.4321
Odds = exp(-2.4321) = 0.0878
Prob = 0.0878 / (1 + 0.0878) = 0.0808
How old does a women have to be to have the same risk?
1-Year of age increases log(odds) by 0.0605
Being female decreases log(odds) by –0.3967
Compute 0.3967/.0605 = 6.6 or women would have to 66.6 years to have P = .0808
PROC LOGISTIC DATA=temp DESCENDING; MODEL clinical = age women/CLODDS=WALD; UNITS age = 5 women = 1;RUN;
Getting Odds Ratio for Differences Other Than 1
SAS OUTPUTWald Confidence Interval for Adjusted Odds Ratios
Effect Unit Estimate 95% Confidence Limits
AGE 5.0000 1.353 1.087 1.685women 1.0000 0.673 0.362 1.251
EXP (5*0.0605)
Testing Differences Among Multiple Groups Using Logistic Regression
• Ho:
• Ha: i not all equal
• Can test using logistic regression since if ’s are equal then log odds are equal
• Can code in SAS two ways– Create dummy (design) variables to represent the groups
– Use a CLASS statement under PROC LOGISTIC
TOMHS Example: Is CVD Rate EqualIn Four Clinical Centers?
• Ho:
• SAS CODE in datastep (create own design variables):
DATA temp; SET tomhs.bpstudy; clinicA = 0; clinicB = 0; clinicC = 0; clinicD = 0; if clinic = 'A' then clinicA = 1; else if clinic = 'B' then clinicB = 1; else if clinic = 'C' then clinicC = 1; else if clinic = 'D' then clinicD = 1;RUN;
Do Simple Analyses First
PROC MEANS N MEAN SUM MIN MAX DATA=temp; CLASS clinic; VAR clinical;RUN;
Analysis Variable : CLINICAL Indicator - Clinical Endpoint
NCLINIC Obs N Mean Sum Minimum Maximum------------------------------------------------------------------------------A 195 195 0.0974359 19.0000000 0 1.0000000
B 251 251 0.0517928 13.0000000 0 1.0000000
C 296 296 0.0472973 14.0000000 0 1.0000000
D 160 160 0.0312500 5.0000000 0 1.0000000
The relative odds (A/D) should be about 3. All betas should be > 0
PROC LOGISTIC CODE
* Using class statement;PROC LOGISTIC DATA=TEMP DESCENDING SIMPLE; CLASS clinic/PARAM=REF; MODEL clinical = clinic ;RUN;
* Using user defined design variables;PROC LOGISTIC DATA=TEMP DESCENDING SIMPLE; MODEL clinical = clinica clinicb clinicc;RUN;
Uses 0/1 coding
Last group as reference
Gives summary statistics
SAS OUTPUT USING CLASS STATEMENT
Response Profile
Ordered Total Value CLINICAL Frequency
1 1 51 2 0 851
Probability modeled is CLINICAL=1.
Class Level Information
Design Variables
Class Value 1 2 3
CLINIC A 1 0 0 B 0 1 0 C 0 0 1 D 0 0 0
Same coding as in datastep
Clinic D reference
SAS OUTPUT USING CLASS STATEMENT
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 7.9632 3 0.0468Score 8.6122 3 0.0349Wald 8.1300 3 0.0434
Type III Analysis of Effects
WaldEffect DF Chi-Square Pr > ChiSq
CLINIC 3 8.1300 0.0434
These are equal because no other variables are in model
SAS OUTPUT USING CLASS STATEMENT
Analysis of Maximum Likelihood Estimates
Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -3.4339 0.4544 57.1196 <.0001CLINIC A 1 1.2080 0.5145 5.5114 0.0189CLINIC B 1 0.5266 0.5363 0.9644 0.3261CLINIC C 1 0.4311 0.5305 0.6604 0.4164
Odds Ratio Estimates
Point 95% WaldEffect Estimate Confidence Limits
CLINIC A vs D 3.347 1.221 9.175CLINIC B vs D 1.693 0.592 4.844CLINIC C vs D 1.539 0.544 4.353
SAS OUTPUT USING MODEL clinical = clinicA clinicB clinicC
Analysis of Maximum Likelihood Estimates
Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -3.4339 0.4544 57.1196 <.0001clinicA 1 1.2080 0.5145 5.5114 0.0189clinicB 1 0.5266 0.5363 0.9644 0.3261clinicC 1 0.4311 0.5305 0.6604 0.4164
Odds Ratio Estimates
Point 95% WaldEffect Estimate Confidence Limits
clinicA 3.347 1.221 9.175clinicB 1.693 0.592 4.844clinicC 1.539 0.544 4.353
Maybe clinic rates of CVD differ because age varies among centers
SAS OUTPUT USING MODEL clinical = clinic age
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 16.5582 4 0.0024Score 17.2001 4 0.0018Wald 16.2760 4 0.0027
Type III Analysis of Effects
WaldEffect DF Chi-Square Pr > ChiSq
CLINIC 3 8.9604 0.0298AGE 1 8.4904 0.0036
Test if age and clinic are related to CVD
SAS OUTPUT USING MODEL clinical = clinic age
Analysis of Maximum Likelihood Estimates
Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -7.2250 1.4096 26.2725 <.0001CLINIC A 1 1.3211 0.5189 6.4816 0.0109CLINIC B 1 0.6448 0.5400 1.4256 0.2325CLINIC C 1 0.5163 0.5335 0.9366 0.3332AGE 1 0.0662 0.0227 8.4904 0.0036
Odds Ratio Estimates
Point 95% WaldEffect Estimate Confidence Limits
CLINIC A vs D 3.747 1.355 10.361CLINIC B vs D 1.906 0.661 5.492CLINIC C vs D 1.676 0.589 4.768AGE 1.068 1.022 1.117
Assumptions: Linear Versus Logistic Regression
• Y normally distributed
• y linearly related to X
• constant over X
• Each observation independent of other observations
• Large N not needed for tests if Y is normally distributed
• Y binary
• Log odds linearly related to X
• N/A
• Each observation independent of other observations
• Large enough N to justify using 2
Illustration of Linearity in Log Odds Assumption
Log odds = -6.2428 + 0.0613* Age
AGE ODDS
50 0.039
60 0.072
70 0.134
RO = 1.85 = .072/.039
RO = 1.85 = .134/.072
Increased relative odds from going from 50 to 60 year is same as going from 60 to 70 years
Note: Absolute risk is not linear with age
Fitted regression line
xp
po 1)
1log(
Curve based on:
o effects location
1 effects curvature