Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
HST190:IntroductiontoBiostatistics
Lecture7:Logisticregression
1 HST190:IntrotoBiostatistics
Logisticregression
• We’vepreviouslydiscussedlinearregressionmethodsforpredictingcontinuousoutcomes§ Functionally,predictingthemeanatparticularcovariatelevels
• Whatifwewanttopredictvaluesforadichotomouscategoricalvariable,insteadofacontinuousone?§ Thiscorrespondstopredictingtheprobabilityoftheoutcomevariablebeinga“1”versus“0”
• Canwejustuselinearregressionfora0-1outcomevariable?
HST190:IntrotoBiostatistics2
HST190:IntrotoBiostatistics3
• Considermodelingtheprobabilitythatapersonreceivesaphysicalexaminagivenyearasafunctionofincome.§ Asampleofindividualsiscollected.Eachindividualreportsincomeandwhetherhe/shewenttothedoctorlastyear.
patient# 𝑌 = checkup 𝑋 = income1 1 32,0002 0 28,0003 0 41,0004 1 38,000etc.
HST190:IntrotoBiostatistics4
• Plottingthisdataandfittingalinearregressionline,weseethatthelinearmodelisnottailoredtothistypeofoutcome§ Forexample,anincomeof$500,000yieldsapredictedprobabilityofvisitingthedoctorgreaterthan1!
Logittransformation
HST190:IntrotoBiostatistics5
• Toovercometheproblem,wedefinethelogittransformation:if0 < 𝑝 < 1,thenlogit 𝑝 = ln 5
675
§ Noticethatas𝑝 ↑ 1, logit 𝑝 ↑ ∞, andas𝑝 ↓ 0, logit 𝑝 ↓ −∞
• Thus,logit 𝑝 cantakeany continuousvalue,sowewillfitlinearmodelonthistransformedoutcomeinstead
• Writethistypeofmodelgenerallyas𝑔 𝐸 𝑦 = 𝛼 + 𝛽6𝑥6 + ⋯+ 𝛽E𝑥E
§ Where𝐸 𝑦 = 1 ⋅ 𝑃 𝑦 = 1 + 0 ⋅ 𝑃 𝑦 = 0 = 𝑃 𝑦 = 1 = 𝑝 and𝑔 𝐸(𝑦) = logit 𝐸(𝑦)
§ Thismodeliscalledalogisticregressionmodel oralogitmodel
§ Bycomparison,thelinearregressionmodeltakes𝑔 𝐸(𝑦) = 𝐸 𝑦
HST190:IntrotoBiostatistics6
• Akeybenefitoffittinglogitmodelratherthancontingencytablemethodsisabilitytoadjustformultiplecovariates(includingcontinuouscovariates)simultaneously
• Tointerpretparameters,comparefitformanandwoman
§ logit 𝑝JKLMN = 𝛼 + 𝛽MOP𝑋MOP + 𝛽QNRKLP𝑋QNRKLP + 𝛽JKLMN
§ logit 𝑝LMN = 𝛼 + 𝛽MOP𝑋MOP + 𝛽QNRKLP𝑋QNRKLP
patient# 𝑦 = checkup income age gender1 1 32,000 60 F2 0 28,000 53 M3 0 41,000 45 M4 1 38,000 40 Fetc.
HST190:IntrotoBiostatistics7
⇒ logit 𝑝LMN = logit 𝑝JKLMN − 𝛽JKLMN
⟺ ln𝑝JKLMN
1 − 𝑝JKLMN− ln
𝑝LMN1 − 𝑝LMN
= 𝛽JKLMN
⟺ ln𝑝JKLMN 1 − 𝑝LMN𝑝LMN 1 − 𝑝JKLMN
= 𝛽JKLMN
⟺𝑝JKLMN 1 − 𝑝LMN𝑝LMN 1 − 𝑝JKLMN
=oddsJKLMNoddsLMN
= 𝑒XYZ[\]
• So,𝛽JKLMNisthelogoftheoddsratioforgettingacheckupbetweenmenandwomen,adjustingforageandincome
• Thisresultholdsforanydichotomousvariableinthemodel§ Thisallowsustoestimateoddsratioforagivenexposurewithdiseaseinaregression,accountingfortheeffectsofothervariables
HST190:IntrotoBiostatistics8
• Inalogisticregressionlogit 𝑝 = 𝛼 + 𝛽6𝑥6 +⋯+ 𝛽E𝑥E,thendenotethefittedparameterestimatesas𝛼_, �̀�6, … , �̀�E
• if𝑋b isadichotomousexposure,thentheestimatedoddsratiorelatingthisexposuretotheoutcomeis
ORe = 𝑒Xfg
• Ifinstead𝑋b isacontinuousexposure,thentheaboveoddsratioandCIdescribetheoutcome’sassociationwithaone-unitincreaseintheexposure,adjustingforothercovariates§ e.g.,“aoneunitincreaseinageisassociatedonaveragewithan𝑒Xf\hi-foldchangeintheoddsofgettingacheckup,holdinggenderconstant.”
Hypothesistestingandconfidenceintervals
HST190:IntrotoBiostatistics9
• Foranestimated�̀�b coefficientinalogisticmodel,thecorresponding100 1 − 𝛼 % CIisgivenby
𝑒Xfg7klmno
pPq Xfg , 𝑒Xfgrklmno
pPq Xfg
§ Matlab orothersoftwarewillprovideboth�̀�b andses �̀�b
§ takenoteofwhetheryouaregiven�̀�b orORe = 𝑒Xfg insoftwareoutput!Thisdiffersbetweenprograms
• Testingthehypothesis𝐻u: �̀�b = 0 versus𝐻6: �̀�b ≠ 0 isaz-test thatistypicallyprovidedaspartofsoftwareoutput
§ Ifthenullistrue,𝑍 = XfgpPq Xfg
isapproximately𝑁(0,1)
Interactionterms
HST190:IntrotoBiostatistics10
• Likeinlinearregression,wecanalsoincorporateinteractiontermsinalogisticregressionmodel
logit 𝑝 = 𝛼 + 𝛽MOP𝑋MOP + 𝛽QNRKLP𝑋QNRKLP+𝛽JKLMN𝑋JKLMN + 𝛽MOP:JKLMN𝑋MOP𝑋JKLMN
• 𝛽MOP:JKLMN capturesthepresenceofaninteractioneffectoreffectmodification ofgenderbyage§ e.g.,gendereffectonprobabilityofgettingannualcheckupisgreateramongyoungerpeople
Modelbuildingforinference
HST190:IntrotoBiostatistics11
• Thetechniquesforvariableselectioninlogisticregressionaresimilarasforlinearregression
§ Biggestchallengeislackofcomparablevisualfitdiagnosticslikeresidualplots
• Whenmodelbuildingforstudiesofassociationbetweenexposureandoutcome,focusisonincludingsourcesofconfounding(i.e.,externalvariablesassociatedwithbothexposureandoutcome)
• Onestrategyistofitandreportthefollowingthreemodels:1) anunadjustedorminimallyadjustedmodel
2) amodelthatincludes‘core’confounders(‘primary’model)o clearindicationfromscientificknowledgeand/ortheliterature
o consensusamonginvestigators
3) amodelthatincludes‘core’confoundersplusany‘potential’confounderso indicationislesscertain
Logisticregressioninretrospectivesetting
HST190:IntrotoBiostatistics12
• Howdoweinterpretinterceptlogit 𝑝 = 𝛼 + 𝛽6𝑥6 + ⋯+ 𝛽E𝑥E§ 𝛼 = log 5
675isthelogoddsofexperiencingtheoutcomeinthe
populationamongsubjectswith𝑥6 = ⋯ = 𝑥E = 0
§ Linksmodeltoabsoluteprevalenceoftheoutcomeinthepopulation
• Whathappenstologitmodelifoursamplingiscase-control(orretrospective)?§ Thatis,whatifwesamplebasedonoutcomestatus?
§ e.g.,sample100patientswithadiseaseand100patientswithout
• Typicallythissettingartificiallyselectsmorecasesthanwouldarisenaturallyundercross-sectionalorprospectivesampling§ sowecannotreadilyusethesampletodescribethetrueprobabilityofdiseaseinthepopulation
HST190:IntrotoBiostatistics13
• Thus,weseethattheintercept𝛼 isnolongermeaningfulinalogisticregressionusingcase-controlsampleddata§ Whatabouttheotherestimates?
• Recallweshowedthatusingcontingencytablestocomputeoddsratioswasvalidbothinprospective andretrospectivesamplingdesigns
• Itturnsoutthatthesameistruefortheestimatedcoefficientsinlogisticregression!Justasbefore,
ORe = 𝑒Xfg
§ allotherinference(tests,CIs)isalsothesame
• Theestimatedoddsratiooftheoutcomebetweenexposedandunexposedgroupsisthesameevenifthe‘absolute’proportionofcasessampledishigher
Matchedcase-controldesigns
HST190:IntrotoBiostatistics14
• Tofurtherincreasethestatisticalefficiencyofastudy,researchersmaycreateamatched case-controldesign§ Foreverycasesampled,oneormorecontrolsisselectedbasedonsimilaritytothecase
o Matchingeachcasewith𝑞 controlsiscalled𝟏: 𝒒matching
§ Goalistocorrectforpotentialconfoundinginthestudydesign§ e.g.,matcheachcasewithnoncase ofsameageandgender,resultingintwogroupshavingsamedistributionsofageandgender
• Aswithstandardcase-control,analysisthenmeasuresassociationbetweenanexposureofinterestandtheoutcome§ Exposureofinterestisnot afactorusedformatching
• Matcheddesignsbalanceincreasedcostofmatchingeachsubjectwithhigherpowerandpotentialforcausalinference
Analyzingmatchedcase-controldesigns
HST190:IntrotoBiostatistics15
• Supposethesampleincludes𝑚matchedsets,howshouldweapproachanalysis?
• Naïveapproach:chooseonematchedsettobe‘baseline,’andinclude𝑚− 1 indicatorvariablesforeachotherset§ Essentially,treateachmatchedsetaslevelofacategoricalvariable
• Suchamodelforcesustoestimateeffectofexposurewithingroupsthatmayonlyhaveafewpeopleinthem§ Unstableestimation
§ Cannotgeneralizeestimatedcomparisonsofspecificpairsofpeople
• Instead,wewantananalysisthatestimatesexposureeffectbyaggregatingacrossmatchedsets
Conditionallogisticregression
HST190:IntrotoBiostatistics16
• Instead,researchersuseconditionallogisticregression toestimatetheeffectofanexposureofinterest,conditioningoutthefactorsusedtocreatethematchedsets
• Toillustrate,assumeamatchedpairsdesign.Let§ 𝑦~6 = 1, 𝑦~� = 0 bethediseaseindicatorsofthe𝑖th case-controlpair§ 𝑥~66, … , 𝑥~6E , 𝑥~�6, … , 𝑥~�E bethecovariatesofthe𝑖th pair
o Doesnotinclude‘matchedon’factors,whichareaccountedforindesign
• Thenforeachpair,definetheconditionallikelihoodcontribution
𝐿~ 𝛽6, … , 𝛽E =𝑃 𝑦~6 = 1 ∩ 𝑦~� = 0
𝑃 𝑦~6 = 1 ∩ 𝑦~� = 0 + 𝑃 𝑦~6 = 0 ∩ 𝑦~� = 1
=𝑒∑ Xg��lg�
g�l
𝑒∑ Xg��lg�g�l + 𝑒∑ Xg��og�
g�l
HST190:IntrotoBiostatistics17
• Thus,wecancomputeestimatesthatmaximizetheconditionallikelihood 𝜷� = argmin𝜷 ∑ 𝐿~ 𝛽6, … , 𝛽E�
~�6
• If𝛽b iscoefficientofexposureofinterest,thenasbeforeOR� = 𝑒X�g
§ StandardmethodsfortestingandCIsareallthesameasbefore
• Notethatbecausewealreadyadjustedforfactorsusedformatching,wedonot getestimatedeffectsforthesefactors§ Itwouldbeinappropriatetoincludematchingfactorsascovariates
• Wealsodonotgetanestimatedintercept,whichmakessensebecauseinterceptnotinterpretableincase-controlsettinganyways
Logisticregressionmodelingforprediction
• Usingafittedlogisticregressionmodel,wesofarhavefocusedonestimationandinferenceofassociationsbetweenthecovariatesandtheoutcomeintheformofoddsratios
• Wecanalsopredictindividualprobabilitiesofexperiencingtheoutcomeusingthefittedmodel:
• Fromourmodellogit 𝑝 = 𝛼 + 𝛽6𝑥6 +⋯+ 𝛽E𝑥E,wecanrearrangetogetapredictedprobability𝑝_:
ln𝑝
1 − 𝑝 = 𝜆 ⇔𝑝
1 − 𝑝 = 𝑒� ⇔ 𝑝 = 𝑒� − 𝑝𝑒�
⇔ 𝑝 1 + 𝑒� = 𝑒� ⇔ 𝑝 =𝑒�
1 + 𝑒�
• Therefore,weseethatourregressionmodelleadstopredictedprobabilities
𝑝_ =𝑒�srXfl�lr⋯rXf���
1 + 𝑒�srXfl�lr⋯rXf���
HST190:IntrotoBiostatistics18
HST190:IntrotoBiostatistics19
• Wemayevenwanttopredictindividualsoutcomestatus,usingthemodeltopredictwhetherornottheywillexperiencetheoutcome§ e.g.,buildariskpredictionmodeltopredictwhomightdevelopadisease
• Justasinthelinearregressioncase,predictionintroducesimportantconsiderationsofmodelselection,andpredictionvalidation
Variableselectionforprediction
HST190:IntrotoBiostatistics20
• Variableselectionforpredictionisalsosimilartolinearregressionsetting,andcanusesimilartechniques:
1) Fixedsetbydesign(treatmentindicator+backgroundvariables)
2) Fitallpossiblesubsetsofmodelsandfindtheonethatfitsthebestaccordingtosomecriterion:§ AICorBIC
§ Predictiveperformanceby(cross-validated)AUC
3) Sequential:forward/backward/stepwiseselection
4) Regularized/penalizedregressionmethod
Modelselectioncriteria
HST190:IntrotoBiostatistics21
• Logisticregressionmodelsfitbymaximumlikelihood,soiftwomodelshavethesamenumberofparameters,choosetheonewithahigherfinallikelihoodvalue§ Similartolinearregression,aimtobalancefinallikelihood(𝐿�)andnumberofparameters(𝑘).
§ Generalformofanycriterion:𝑓 𝐿� + 𝑔(𝑘)
• Samecriteriausedforlinearregressionavailablehere:
§ Akaike’sInformationCriterion:AIC = −2ln 𝐿� + 2𝑝
§ BayesianInformationCriterion:BIC = −2ln 𝐿� + 𝑝log 𝑛
Binarypredictionvalidation
HST190:IntrotoBiostatistics22
• Tomeasurepredictiveperformanceforbinaryoutcomes,oneapproachreturnstoourdiscussionofdiagnostictesting
• Recall,indiagnostictestingitisimportanttobalance§ Correctlytestingpositivefortruediseasecases(‘sensitivity’)
§ Correctlytestingnegativefortruenon-cases(‘specificity’)
Testpositive Testnegative
Disease TruePositive(TP)𝐷 +∩ 𝑇 +
FalseNegative(FN)𝐷 +∩ 𝑇 − 𝐷 +
NoDisease FalsePositive(FP)𝐷 −∩ 𝑇 +
TrueNegative(TN)𝐷 −∩ 𝑇 − 𝐷 −
𝑇 + 𝑇 −
HST190:IntrotoBiostatistics23
• Binarypredictionisnearlyidentical,whereinsteadof‘testing’weare‘predicting’diseasestatus§ Wanttocorrectlypredictdiseaseintruecases,andcorrectlypredictnodiseaseintruenon-cases
• Howdoweconvertpredictedindividualprobabilities�̂� intodiscrete‘case’or‘non-case’predictions?§ wemustchooseanarbitrarycutoffvalue,e.g.,“if�̂�~ > 0.5 then𝑖thindividualispredictedtobeacase”
HST190:IntrotoBiostatistics24
• HowtochoosethecutoffvaluefortheVit.Elevelinserum?
• ForagroupwithKNOWNDiseasestatus,let’slistsomepossiblecutoffvalues.First,we’llseehowmanydiseasevs.nodiseasepatientsfalloneithersideofeachcutoff…
Predictedprobabilitycutoffforpredicted‘case’0.01 0.20 0.40 0.60 0.80 0.99
%patientswithvalue≥cut-off
Disease 0.95 0.87 0.73 0.54 0.34 0.17
NoDisease 0.91 0.68 0.38 0.12 0.02 0.002
HST190:IntrotoBiostatistics25
• Severalimportantrelationshipshere:§ 𝑃(�̂�~ ≥ cutoff ∩ patient𝑖iscase) = sensitivity
§ 𝑃 �̂�~ ≥ cutoff ∩ patient𝑖isnoncase = 1 − specificity
• Wecansummarizethetestingpossibilitiesbyplottingsensitivityvs.(1– specificity)…
Predictedprobabilitycutofffor‘case’0.01 0.20 0.40 0.60 0.80 0.99
%patientswithvalue≥cut-off
Disease 0.95 0.87 0.73 0.54 0.34 0.17
NoDisease 0.91 0.68 0.38 0.12 0.02 0.002
HST190:IntrotoBiostatistics26
Predictedprobabilitycutofffor‘case’0.01 0.20 0.40 0.60 0.80 0.99
Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17
1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002
HST190:IntrotoBiostatistics27
Predictedprobabilitycutofffor‘case’0.01 0.20 0.40 0.60 0.80 0.99
Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17
1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002
HST190:IntrotoBiostatistics28
Predictedprobabilitycutofffor‘case’0.01 0.20 0.40 0.60 0.80 0.99
Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17
1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002
HST190:IntrotoBiostatistics29
• Areceiveroperatingcharacteristic(ROC)curve foratestisaplotofsensitivityvs.(1-specificity)
§ Atest’sROCcurvehelpsuschooseanoptimalcutoffpoint.Italsoshowsushowusefulatestisoverall.
• TheAreaUndertheCurve(AUC)isasinglenumbersummarizingatest’sabilitytodiscriminate betweentruepositivesandtruenegatives.
§ AUCistheprobabilityforarandomlychosencaseandnoncase thatthecasewillhavethehigherpredictedprobability—0.5isa‘cointoss’
Cross-validation
HST190:IntrotoBiostatistics30
• Cross-validationextendstothebinaryclassificationsettingusingpredictionmetricslikeAUC
• Theavailabledatasetisdividedintotwo(or3)randomparts.§ Trainingsetisusedtofitthemodel.
§ Testsetisusedtocheckthepredictivecapability(e.g.,AUC)andrefinethemodel.Gobacktotrainingifneeded.
§ Optional:Validationsetusedoncetoestimatemodel’strueAUC.
• Ifdatasetissmalleroryoudonotwanttosetasidedata,canstillestimateAUCusing𝒌-foldcrossvalidation
HST190:IntrotoBiostatistics31
• In𝒌-foldcrossvalidation,thedataissplitinto𝑘 groups,andthemodelisrepeatedlyfitonallbutonegroup,thenitsabilitytopredicttheleft-outgroupisrecorded§ averageAUCoverall𝑘 groupsestimatespredictiveperformanceon‘new’dataset