HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntroductiontoBiostatistics

Lecture7:Logisticregression

1 HST190:IntrotoBiostatistics

Logisticregression

• We’vepreviouslydiscussedlinearregressionmethodsforpredictingcontinuousoutcomes§ Functionally,predictingthemeanatparticularcovariatelevels

• Whatifwewanttopredictvaluesforadichotomouscategoricalvariable,insteadofacontinuousone?§ Thiscorrespondstopredictingtheprobabilityoftheoutcomevariablebeinga“1”versus“0”

• Canwejustuselinearregressionfora0-1outcomevariable?

HST190:IntrotoBiostatistics2


• Considermodelingtheprobabilitythatapersonreceivesaphysicalexaminagivenyearasafunctionofincome.§ Asampleofindividualsiscollected.Eachindividualreportsincomeandwhetherhe/shewenttothedoctorlastyear.

patient# 𝑌 = checkup 𝑋 = income1 1 32,0002 0 28,0003 0 41,0004 1 38,000etc.


• Plottingthisdataandfittingalinearregressionline,weseethatthelinearmodelisnottailoredtothistypeofoutcome§ Forexample,anincomeof$500,000yieldsapredictedprobabilityofvisitingthedoctorgreaterthan1!

Logittransformation


• Toovercometheproblem,wedefinethelogittransformation:if0 < 𝑝 < 1,thenlogit 𝑝 = ln 5

675

§ Noticethatas𝑝 ↑ 1, logit 𝑝 ↑ ∞, andas𝑝 ↓ 0, logit 𝑝 ↓ −∞

• Thus,logit 𝑝 cantakeany continuousvalue,sowewillfitlinearmodelonthistransformedoutcomeinstead

• Writethistypeofmodelgenerallyas𝑔 𝐸 𝑦 = 𝛼 + 𝛽6𝑥6 + ⋯+ 𝛽E𝑥E

§ Where𝐸 𝑦 = 1 ⋅ 𝑃 𝑦 = 1 + 0 ⋅ 𝑃 𝑦 = 0 = 𝑃 𝑦 = 1 = 𝑝 and𝑔 𝐸(𝑦) = logit 𝐸(𝑦)

§ Thismodeliscalledalogisticregressionmodel oralogitmodel

§ Bycomparison,thelinearregressionmodeltakes𝑔 𝐸(𝑦) = 𝐸 𝑦


• Akeybenefitoffittinglogitmodelratherthancontingencytablemethodsisabilitytoadjustformultiplecovariates(includingcontinuouscovariates)simultaneously

• Tointerpretparameters,comparefitformanandwoman

§ logit 𝑝JKLMN = 𝛼 + 𝛽MOP𝑋MOP + 𝛽QNRKLP𝑋QNRKLP + 𝛽JKLMN

§ logit 𝑝LMN = 𝛼 + 𝛽MOP𝑋MOP + 𝛽QNRKLP𝑋QNRKLP

patient# 𝑦 = checkup income age gender1 1 32,000 60 F2 0 28,000 53 M3 0 41,000 45 M4 1 38,000 40 Fetc.


⇒ logit 𝑝LMN = logit 𝑝JKLMN − 𝛽JKLMN

⟺ ln𝑝JKLMN

1 − 𝑝JKLMN− ln

𝑝LMN1 − 𝑝LMN

= 𝛽JKLMN

⟺ ln𝑝JKLMN 1 − 𝑝LMN𝑝LMN 1 − 𝑝JKLMN

= 𝛽JKLMN

⟺𝑝JKLMN 1 − 𝑝LMN𝑝LMN 1 − 𝑝JKLMN

=oddsJKLMNoddsLMN

= 𝑒XYZ[\]

• So,𝛽JKLMNisthelogoftheoddsratioforgettingacheckupbetweenmenandwomen,adjustingforageandincome

• Thisresultholdsforanydichotomousvariableinthemodel§ Thisallowsustoestimateoddsratioforagivenexposurewithdiseaseinaregression,accountingfortheeffectsofothervariables


• Inalogisticregressionlogit 𝑝 = 𝛼 + 𝛽6𝑥6 +⋯+ 𝛽E𝑥E,thendenotethefittedparameterestimatesas𝛼_, �̀�6, … , �̀�E

• if𝑋b isadichotomousexposure,thentheestimatedoddsratiorelatingthisexposuretotheoutcomeis

ORe = 𝑒Xfg

• Ifinstead𝑋b isacontinuousexposure,thentheaboveoddsratioandCIdescribetheoutcome’sassociationwithaone-unitincreaseintheexposure,adjustingforothercovariates§ e.g.,“aoneunitincreaseinageisassociatedonaveragewithan𝑒Xf\hi-foldchangeintheoddsofgettingacheckup,holdinggenderconstant.”

Hypothesistestingandconfidenceintervals


• Foranestimated�̀�b coefficientinalogisticmodel,thecorresponding100 1 − 𝛼 % CIisgivenby

𝑒Xfg7klmno

pPq Xfg , 𝑒Xfgrklmno

pPq Xfg

§ Matlab orothersoftwarewillprovideboth�̀�b andses �̀�b

§ takenoteofwhetheryouaregiven�̀�b orORe = 𝑒Xfg insoftwareoutput!Thisdiffersbetweenprograms

• Testingthehypothesis𝐻u: �̀�b = 0 versus𝐻6: �̀�b ≠ 0 isaz-test thatistypicallyprovidedaspartofsoftwareoutput

§ Ifthenullistrue,𝑍 = XfgpPq Xfg

isapproximately𝑁(0,1)

Interactionterms


• Likeinlinearregression,wecanalsoincorporateinteractiontermsinalogisticregressionmodel

logit 𝑝 = 𝛼 + 𝛽MOP𝑋MOP + 𝛽QNRKLP𝑋QNRKLP+𝛽JKLMN𝑋JKLMN + 𝛽MOP:JKLMN𝑋MOP𝑋JKLMN

• 𝛽MOP:JKLMN capturesthepresenceofaninteractioneffectoreffectmodification ofgenderbyage§ e.g.,gendereffectonprobabilityofgettingannualcheckupisgreateramongyoungerpeople

Modelbuildingforinference


• Thetechniquesforvariableselectioninlogisticregressionaresimilarasforlinearregression

§ Biggestchallengeislackofcomparablevisualfitdiagnosticslikeresidualplots

• Whenmodelbuildingforstudiesofassociationbetweenexposureandoutcome,focusisonincludingsourcesofconfounding(i.e.,externalvariablesassociatedwithbothexposureandoutcome)

• Onestrategyistofitandreportthefollowingthreemodels:1) anunadjustedorminimallyadjustedmodel

2) amodelthatincludes‘core’confounders(‘primary’model)o clearindicationfromscientificknowledgeand/ortheliterature

o consensusamonginvestigators

3) amodelthatincludes‘core’confoundersplusany‘potential’confounderso indicationislesscertain

Logisticregressioninretrospectivesetting


• Howdoweinterpretinterceptlogit 𝑝 = 𝛼 + 𝛽6𝑥6 + ⋯+ 𝛽E𝑥E§ 𝛼 = log 5

675isthelogoddsofexperiencingtheoutcomeinthe

populationamongsubjectswith𝑥6 = ⋯ = 𝑥E = 0

§ Linksmodeltoabsoluteprevalenceoftheoutcomeinthepopulation

• Whathappenstologitmodelifoursamplingiscase-control(orretrospective)?§ Thatis,whatifwesamplebasedonoutcomestatus?

§ e.g.,sample100patientswithadiseaseand100patientswithout

• Typicallythissettingartificiallyselectsmorecasesthanwouldarisenaturallyundercross-sectionalorprospectivesampling§ sowecannotreadilyusethesampletodescribethetrueprobabilityofdiseaseinthepopulation


• Thus,weseethattheintercept𝛼 isnolongermeaningfulinalogisticregressionusingcase-controlsampleddata§ Whatabouttheotherestimates?

• Recallweshowedthatusingcontingencytablestocomputeoddsratioswasvalidbothinprospective andretrospectivesamplingdesigns

• Itturnsoutthatthesameistruefortheestimatedcoefficientsinlogisticregression!Justasbefore,

ORe = 𝑒Xfg

§ allotherinference(tests,CIs)isalsothesame

• Theestimatedoddsratiooftheoutcomebetweenexposedandunexposedgroupsisthesameevenifthe‘absolute’proportionofcasessampledishigher

Matchedcase-controldesigns


• Tofurtherincreasethestatisticalefficiencyofastudy,researchersmaycreateamatched case-controldesign§ Foreverycasesampled,oneormorecontrolsisselectedbasedonsimilaritytothecase

o Matchingeachcasewith𝑞 controlsiscalled𝟏: 𝒒matching

§ Goalistocorrectforpotentialconfoundinginthestudydesign§ e.g.,matcheachcasewithnoncase ofsameageandgender,resultingintwogroupshavingsamedistributionsofageandgender

• Aswithstandardcase-control,analysisthenmeasuresassociationbetweenanexposureofinterestandtheoutcome§ Exposureofinterestisnot afactorusedformatching

• Matcheddesignsbalanceincreasedcostofmatchingeachsubjectwithhigherpowerandpotentialforcausalinference

Analyzingmatchedcase-controldesigns


• Supposethesampleincludes𝑚matchedsets,howshouldweapproachanalysis?

• Naïveapproach:chooseonematchedsettobe‘baseline,’andinclude𝑚− 1 indicatorvariablesforeachotherset§ Essentially,treateachmatchedsetaslevelofacategoricalvariable

• Suchamodelforcesustoestimateeffectofexposurewithingroupsthatmayonlyhaveafewpeopleinthem§ Unstableestimation

§ Cannotgeneralizeestimatedcomparisonsofspecificpairsofpeople

• Instead,wewantananalysisthatestimatesexposureeffectbyaggregatingacrossmatchedsets

Conditionallogisticregression


• Instead,researchersuseconditionallogisticregression toestimatetheeffectofanexposureofinterest,conditioningoutthefactorsusedtocreatethematchedsets

• Toillustrate,assumeamatchedpairsdesign.Let§ 𝑦~6 = 1, 𝑦~� = 0 bethediseaseindicatorsofthe𝑖th case-controlpair§ 𝑥~66, … , 𝑥~6E , 𝑥~�6, … , 𝑥~�E bethecovariatesofthe𝑖th pair

o Doesnotinclude‘matchedon’factors,whichareaccountedforindesign

• Thenforeachpair,definetheconditionallikelihoodcontribution

𝐿~ 𝛽6, … , 𝛽E =𝑃 𝑦~6 = 1 ∩ 𝑦~� = 0

𝑃 𝑦~6 = 1 ∩ 𝑦~� = 0 + 𝑃 𝑦~6 = 0 ∩ 𝑦~� = 1

=𝑒∑ Xg��lg�

g�l

𝑒∑ Xg��lg�g�l + 𝑒∑ Xg��og�

g�l


• Thus,wecancomputeestimatesthatmaximizetheconditionallikelihood 𝜷� = argmin𝜷 ∑ 𝐿~ 𝛽6, … , 𝛽E�

~�6

• If𝛽b iscoefficientofexposureofinterest,thenasbeforeOR� = 𝑒X�g

§ StandardmethodsfortestingandCIsareallthesameasbefore

• Notethatbecausewealreadyadjustedforfactorsusedformatching,wedonot getestimatedeffectsforthesefactors§ Itwouldbeinappropriatetoincludematchingfactorsascovariates

• Wealsodonotgetanestimatedintercept,whichmakessensebecauseinterceptnotinterpretableincase-controlsettinganyways

Logisticregressionmodelingforprediction

• Usingafittedlogisticregressionmodel,wesofarhavefocusedonestimationandinferenceofassociationsbetweenthecovariatesandtheoutcomeintheformofoddsratios

• Wecanalsopredictindividualprobabilitiesofexperiencingtheoutcomeusingthefittedmodel:

• Fromourmodellogit 𝑝 = 𝛼 + 𝛽6𝑥6 +⋯+ 𝛽E𝑥E,wecanrearrangetogetapredictedprobability𝑝_:

ln𝑝

1 − 𝑝 = 𝜆 ⇔𝑝

1 − 𝑝 = 𝑒� ⇔ 𝑝 = 𝑒� − 𝑝𝑒�

⇔ 𝑝 1 + 𝑒� = 𝑒� ⇔ 𝑝 =𝑒�

1 + 𝑒�

• Therefore,weseethatourregressionmodelleadstopredictedprobabilities

𝑝_ =𝑒�srXfl�lr⋯rXf��

1 + 𝑒�srXfl�lr⋯rXf��



• Wemayevenwanttopredictindividualsoutcomestatus,usingthemodeltopredictwhetherornottheywillexperiencetheoutcome§ e.g.,buildariskpredictionmodeltopredictwhomightdevelopadisease

• Justasinthelinearregressioncase,predictionintroducesimportantconsiderationsofmodelselection,andpredictionvalidation

Variableselectionforprediction


• Variableselectionforpredictionisalsosimilartolinearregressionsetting,andcanusesimilartechniques:

1) Fixedsetbydesign(treatmentindicator+backgroundvariables)

2) Fitallpossiblesubsetsofmodelsandfindtheonethatfitsthebestaccordingtosomecriterion:§ AICorBIC

§ Predictiveperformanceby(cross-validated)AUC

3) Sequential:forward/backward/stepwiseselection

4) Regularized/penalizedregressionmethod

Modelselectioncriteria


• Logisticregressionmodelsfitbymaximumlikelihood,soiftwomodelshavethesamenumberofparameters,choosetheonewithahigherfinallikelihoodvalue§ Similartolinearregression,aimtobalancefinallikelihood(𝐿�)andnumberofparameters(𝑘).

§ Generalformofanycriterion:𝑓 𝐿� + 𝑔(𝑘)

• Samecriteriausedforlinearregressionavailablehere:

§ Akaike’sInformationCriterion:AIC = −2ln 𝐿� + 2𝑝

§ BayesianInformationCriterion:BIC = −2ln 𝐿� + 𝑝log 𝑛

Binarypredictionvalidation


• Tomeasurepredictiveperformanceforbinaryoutcomes,oneapproachreturnstoourdiscussionofdiagnostictesting

• Recall,indiagnostictestingitisimportanttobalance§ Correctlytestingpositivefortruediseasecases(‘sensitivity’)

§ Correctlytestingnegativefortruenon-cases(‘specificity’)

Testpositive Testnegative

Disease TruePositive(TP)𝐷 +∩ 𝑇 +

FalseNegative(FN)𝐷 +∩ 𝑇 − 𝐷 +

NoDisease FalsePositive(FP)𝐷 −∩ 𝑇 +

TrueNegative(TN)𝐷 −∩ 𝑇 − 𝐷 −

𝑇 + 𝑇 −


• Binarypredictionisnearlyidentical,whereinsteadof‘testing’weare‘predicting’diseasestatus§ Wanttocorrectlypredictdiseaseintruecases,andcorrectlypredictnodiseaseintruenon-cases

• Howdoweconvertpredictedindividualprobabilities�̂� intodiscrete‘case’or‘non-case’predictions?§ wemustchooseanarbitrarycutoffvalue,e.g.,“if�̂�~ > 0.5 then𝑖thindividualispredictedtobeacase”


• HowtochoosethecutoffvaluefortheVit.Elevelinserum?

• ForagroupwithKNOWNDiseasestatus,let’slistsomepossiblecutoffvalues.First,we’llseehowmanydiseasevs.nodiseasepatientsfalloneithersideofeachcutoff…

Predictedprobabilitycutoffforpredicted‘case’0.01 0.20 0.40 0.60 0.80 0.99

%patientswithvalue≥cut-off

Disease 0.95 0.87 0.73 0.54 0.34 0.17

NoDisease 0.91 0.68 0.38 0.12 0.02 0.002


• Severalimportantrelationshipshere:§ 𝑃(�̂�~ ≥ cutoff ∩ patient𝑖iscase) = sensitivity

§ 𝑃 �̂�~ ≥ cutoff ∩ patient𝑖isnoncase = 1 − specificity

• Wecansummarizethetestingpossibilitiesbyplottingsensitivityvs.(1– specificity)…

Predictedprobabilitycutofffor‘case’0.01 0.20 0.40 0.60 0.80 0.99

%patientswithvalue≥cut-off

Disease 0.95 0.87 0.73 0.54 0.34 0.17

NoDisease 0.91 0.68 0.38 0.12 0.02 0.002



Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17

1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002



Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17

1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002



Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17

1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002


• Areceiveroperatingcharacteristic(ROC)curve foratestisaplotofsensitivityvs.(1-specificity)

§ Atest’sROCcurvehelpsuschooseanoptimalcutoffpoint.Italsoshowsushowusefulatestisoverall.

• TheAreaUndertheCurve(AUC)isasinglenumbersummarizingatest’sabilitytodiscriminate betweentruepositivesandtruenegatives.

§ AUCistheprobabilityforarandomlychosencaseandnoncase thatthecasewillhavethehigherpredictedprobability—0.5isa‘cointoss’

Cross-validation


• Cross-validationextendstothebinaryclassificationsettingusingpredictionmetricslikeAUC

• Theavailabledatasetisdividedintotwo(or3)randomparts.§ Trainingsetisusedtofitthemodel.

§ Testsetisusedtocheckthepredictivecapability(e.g.,AUC)andrefinethemodel.Gobacktotrainingifneeded.

§ Optional:Validationsetusedoncetoestimatemodel’strueAUC.

• Ifdatasetissmalleroryoudonotwanttosetasidedata,canstillestimateAUCusing𝒌-foldcrossvalidation


• In𝒌-foldcrossvalidation,thedataissplitinto𝑘 groups,andthemodelisrepeatedlyfitonallbutonegroup,thenitsabilitytopredicttheleft-outgroupisrecorded§ averageAUCoverall𝑘 groupsestimatespredictiveperformanceon‘new’dataset

Documents

HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored