China Lecture 5

Introduction to logistic regression and Generalized Linear Models

July 14, 2011

Introduction to Statistical Measurement and Modeling

Karen Bandeen-Roche, PhDDepartment of BiostatisticsJohns Hopkins University

Data motivationOsteoporosis dataScientific question: Can we detect osteoporosis earlier and more safely?Some related statistical questions: How does the risk of osteoporosis vary as a function of measures commonly used to screen for osteoporosis?Does age confound the relationship of screening measures with osteoporosis risk? Do ultrasound and DPA measurements discriminate osteoporosis risk independently of each other?

OutlineWhy we need to generalize linear modelsGeneralized Linear Model specificationSystematic, random model componentsMaximum likelihood estimationLogistic regression as a special case of GLMSystematic model / interpretationInferenceExample

Regression for categorical outcomes Why not just apply linear regression to categorical Ys?

Linear model (A1) will often be unreasonable.

Assumption of equal variances (A3) will nearly always be unreasonable.

Assumption of normality will never be reasonable

Introduction:Regression for binary outcomes Yi = 1{event occurs for sampling unit i}= 1 if the event occurs= 0 otherwise. pi = probability that the event occurs for sampling unit i:= Pr{Yi = 1}Begin by generalizing random model (A5):Probability mass function: BernoulliPr{Yi = 1} = pi; Pr{Yi = 0} = 1-pi all other yi occur with 0 probability

Binary regressionBy assuming Bernoulli: (A3) is definitely not reasonableVar(Yi ) = pi(1-pi)Variance is not constant: rather a function of the meanSystematic modelGoal remains to describe E[Yi|xi]Expectation of Bernoulli Yi = piTo achieve a reasonable linear model (A1): describe some function of E[Yi|xi] as a linear function of covariatesg(E[Yi|xi]) = xi Some common g: log, log{p/(1-p)}, probit

General framework:Generalized Linear ModelsRandom modelY~a density or mass function, fY, not necessarily normalTechnical aside: fY within the exponential family Systematic modelg(E[Yi|xi]) = xi = ig = link function; xi = linear predictorReference: Nelder JA, Wedderburn RWM, Generalized linear models, JRSSA 1972; 135:370-384.

Types of Generalized Linear Models

Model (link function)ResponseDistributionRegressionCoef InterpLinearContinuousGaussianChange in ave(Y) per unit change in XLogisticBinaryBinomialLog odds ratioLog-linearTimes to events/countsPoissonLog relative rate

Proportional hazardsTimes to eventsSemi-parametricLog hazard

EstimationEstimation: maximizes L(,a;y,X) =

General method: Maximum likelihood (Fisher)Given {Y1,...,Yn} distributed with joint density or mass function fY(y;), a likelihood function L(;y) is any function (of ) that is proportional to fY(y;).

If sampling is random, {Y1,...,Yn} are statistically independent, and L(;y) product of individual f.

Maximum likelihoodThe maximum likelihood estimate (MLE), , maximizes L(;y):

Under broad assumptions MLEs are asymptoticallyUnbiased (consistent)Efficient (most precise / lowest variance)

Logistic regressionYi binary with pi = Pr{Yi = 1}Example: Yi = 1{person i diagnosed with heart disease}Simple logistic regression (1 covariate)Random Model: Bernoulli / BinomialSystematic Model: log{pi/(1- pi)}= 0 + 1xilog odds; logit(pi)Parameter interpretation0 = log(heart disease odds) in subpopulation with x=01 = log{px+1/(1-px+1)}- log{px/(1-px)}

Logistic regressionInterpretation notes1 = log{px+1/(1-px+1)}- log{px/(1-px)}

=

exp(1) =

= odds ratio for association of prevalent heart disease with each (say) one year increment in age= factor by which odds of heart disease increases / decreases with each 1-year cohort of age

Multiple logistic regressionSystematic Model: log{pi/(1- pi)}= 0 + 1xi1 + + pxip Parameter interpretation0 = log(heart disease odds) in subpopulation with all x=0j = difference in log outcome odds comparing subpopulations who differ by 1 on xj, and whose values on all other covariates are the sameAdjusting for, Controlling for the other covariatesOne can define variables contrasting outcome odds differences between groups, nonlinear relationships, interactions, etc., just as in linear regression

Logistic regression - predictionTranslation from i to pilog{pi/(1- pi)}= 0 + 1xi1 + + pxip

Then = logistic function of i

Graph of pi versus i has a sigmoid shape

GLMs - InferenceThe negative inverse Hessian matrix of the log likelihood function characterizes Var( ) (adjunct)SE( ) obtained as square root of the jth diagonal entryTypically, substituting for Wald inference applies the paradigm from Lecture 2Z = is asympotically ~ N(0,1) under H0: j= 0j

Z provides a test statistic for H0: j= 0j versus HA: j 0j z(1-/2) SE{ } =(L,U) is a (1-)x100% CI for j{exp(L),exp(U)} is a (1-)x100% CI for exp(j)

GLMs: Global InferenceAnalog: F-testing in linear regressionThe only difference: log likelihoods replace SS Hypothesis to be tested is H0: j1=...=jk = 0 Fit model excluding xj1,...,xjpj: Save -2 log likelihood = LsFit full (or larger) model adding xj1,...,xjpj to smaller model. Save -2 log likelihood = LLTest statistic S = Ls - LLDistribution under null hypothesis: 2pjDefine rejection region based on this distributionCompute SReject or not as S is in rejection region or not

GLMs: Global InferenceMany programs refer to deviance rather than -2 log likelihoodThis quantity equals the difference in -2 log likelihoods between ones fitted model and a saturated modelDeviance measures fitDifferences in deviances can be substituted for differences in -2 log likelihood in the method given on the previous pageLikelihood ratio tests have appealing optimality properties

Outline: A few more topicsModel checking: Residuals, influence pointsML can be written as an iteratively reweighted least squares algorithmPredictive accuracyFramework generalizes easily

Main PointsGeneralized linear modeling provides a flexible regression framework for a variety of response typesContinuous, categorical measurement scalesProbability distributions tailored to the outcomeSystematic model to accommodate Measurement range, interpretationLogistic regressionBinary responses (yes, no)Bernoulli / binomial distributionRegression coefficients as log odds ratios for association between predictors and outcomes

Main PointsGeneralized linear modeling accommodates description, inference, adjustment with the same flexibility as linear modelingInferenceWald- statistical tests and confidence intervals via parameter estimator standardizationLikelihood ratio / global via comparison of log likelihoods from nested models

*Do the example of binary Y*Show Var (Y) = p(1-p)*Note a is a scale parameter*2005 JHU Department of Biostatistics*Note a is a scale parameter*Go to adjunct handout*1. Parameter interpretation: Categorical xi (male, female example)

e.g., = 1{person i is female}

Males: log({p_i} over {1-p_i}) = 0; Females: log({p_i} over {1-p_i}) = 0+ 1

0 = log({p_0} over {1-p_0}) = log odds of heart disease among men in

population under study; p0 = Pr{Yi = 1|xi = 0}. 1 = log({p_1} over {1-p_1})- log({p_0} over {1-p_0}) = log left ( {{p_1}/(1-p_1)} over {{p_0} /(1-p_0)} right )

= log ratio of heart disease odds in women to heart disease odds in men.[OTHER: FACTOR BY WHICH ODDS INC.]

ALSO, continuous x**1. Parameter interpretation: Categorical xi (male, female example)

e.g., = 1{person i is female}

Males: log({p_i} over {1-p_i}) = 0; Females: log({p_i} over {1-p_i}) = 0+ 1

0 = log({p_0} over {1-p_0}) = log odds of heart disease among men in

population under study; p0 = Pr{Yi = 1|xi = 0}. 1 = log({p_1} over {1-p_1})- log({p_0} over {1-p_0}) = log left ( {{p_1}/(1-p_1)} over {{p_0} /(1-p_0)} right )

= log ratio of heart disease odds in women to heart disease odds in men.[OTHER: FACTOR BY WHICH ODDS INC.]

ALSO, continuous x*Show re the confidence interval*Show re the confidence interval**

Documents

China Lecture 5