View
0
Download
0
Category
Preview:
Citation preview
Day2,Morning:TheLogicofDistribu8ons
Instructor:MarshallA.Taylor844FlannerHallmtaylo15@nd.edumarshalltaylor.net
Recap
• Yesterdaymorningwetalkedabout:– Thedescrip8veandinferen8alpurposesofsta8s8cs
– Thedifferencebetweensamples,popula8ons,andtheissuesthatarisebecauseofsamplingerrorandsamplebias—andthewaysinwhichprobabilitytheorycanbeusedtoaddresstheformer.
– Thebasicsofprobabilitytheory
Recap
• Yesterdayeveningwetalkedabout:– Thedifferencebetweenunivariate,bivariate,andmul8variatesta8s8cs.
– Whatavariableisandthegeneralformsthatitcantake:nominal,ordinal,orcon8nuous.
– Measuresofcentraltendencyforeachtypeofvariable.
– Measuresofdispersion.
GamePlanforToday• Morning– Wewillbringthesepreviouslecturestogethertoshowhowwecanuseprobabilitytheorytoassesshowrepresenta8veoursampledistribu.onisofthepopula.ondistribu.on.
• Evening– AReroutliningtheasympto.ctheoryofprobabilitydistribu.onsinthemorning,wewillthenexaminebasicunivariatetestsofsta.s.calinferencetoquan8fyhowwelloursamplesta8s8csapproximatepopula8onparametersnetofsamplingerror.
Whatisadistribu8on?
Insta8s8cs,adistribu.onissimplythearrayofvaluesforoneormorevariablesacrossasetofunits(people,groups,etc.).
Thedistribu8oniseverything
• Theconceptof“distribu8on”hasbeenattheimplicitcenterofabsolutelyeverythingwehavetalkedaboutsofar!
• Wehavespecificallylookedatsamplesta8s8cdistribu8ons,suchas…
FrequencyDistribu8ons
Total 10,335 100.00 excellent 2,407 23.29 100.00 good 2,591 25.07 76.71 average 2,938 28.43 51.64 fair 1,670 16.16 23.21 poor 729 7.05 7.05 5=excellent Freq. Percent Cum.1=poor,...,
. tab health
Frequencydistribu8onsas“histograms”
050
010
0015
00Fr
eque
ncy
50 100 150 200 250 300systolic blood pressure
Bimodaldistribu8onwithcategoricalvariables
0.2
.4.6
.8
Male Female
Solid R or D Likely R or DLeaning R or D Toss-Up
Bimodaldistribu8onwithcon8nuousvariables
020
040
060
0Fr
eque
ncy
20 40 60 80age in years
Notethetwopeaks.
Distribu8onsandProbabilityTheory
• Distribu8onsservemuchmorethanadescrip8vepurpose.
• Werelyontheasympto.ctheoryofprobabilitydistribu.onstomakesta.s.calinferences.
• Suchatheorygivesusdependableideasaboutwhatthesamplingdistribu.onwilllooklike.
Whatis“asympto8c”?
• “Asympto8c”referstothepropertythat,ifsampledaninfinitenumberof8mes,asta8s8cwillconvergetothepopula8onparameteritismeanttoapproximate.
• Thatis,as,assumingthatallnarerandomsamplesofN.
Whatisa“probabilitydistribu8on”?
• A“probabilitydistribu8on”isanarrayofprobabilis8cvaluesforavariableacrossasetofunits,wherethevaluesarepropor8onsthatmustsumto1.
V1 V2 V3 V4 V5 RMargins1 0.0000576938 0.1088889841 0.1090984138 0.0000576950 0.7818972133 12 0.0000476265 0.0000476352 0.9123043912 0.0119956500 0.0756046971 13 0.0000881123 0.0179937158 0.0264640660 0.0000881160 0.9553659899 14 0.1108000716 0.2503997467 0.5927814401 0.0459855150 0.0000332266 15 0.2411895005 0.0404012131 0.6524702741 0.0001041705 0.0658348418 16 0.1470047772 0.0002170118 0.6761725881 0.0002169660 0.1763886570 17 0.8856909695 0.0001638012 0.0001637739 0.0001637537 0.1138177018 18 0.2807926994 0.2093476438 0.0530959614 0.0000664792 0.4566972163 19 0.5049335568 0.0260158232 0.0000863853 0.4424611335 0.0265031011 110 0.0203117985 0.3800703244 0.2362073342 0.0352103566 0.3282001863 111 0.1901308947 0.5532532962 0.0002811341 0.0002811777 0.2560534974 112 0.3384214744 0.4279357679 0.1207937032 0.1128139993 0.0000350552 113 0.8520671572 0.0001154451 0.0462275639 0.1014744017 0.0001154321 114 0.8163828385 0.1158173532 0.0000841640 0.0213096942 0.0464059501 115 0.4869992267 0.2536825278 0.0000751098 0.0210530230 0.2381901127 1
Whatisa“probabilitydistribu8on”?
v1 v2 v3 v4 v5 RMarginsr1 0.0000576938 0.1088889841 0.1090984138 0.0000576950 0.7818972133 1r2 0.0000476265 0.0000476352 0.9123043912 0.0119956500 0.0756046971 1r3 0.0000881123 0.0179937158 0.0264640660 0.0000881160 0.9553659899 1r4 0.1108000716 0.2503997467 0.5927814401 0.0459855150 0.0000332266 1r5 0.2411895005 0.0404012131 0.6524702741 0.0001041705 0.0658348418 1r6 0.1470047772 0.0002170118 0.6761725881 0.0002169660 0.1763886570 1r7 0.8856909695 0.0001638012 0.0001637739 0.0001637537 0.1138177018 1r8 0.2807926994 0.2093476438 0.0530959614 0.0000664792 0.4566972163 1r9 0.5049335568 0.0260158232 0.0000863853 0.4424611335 0.0265031011 1r10 0.0203117985 0.3800703244 0.2362073342 0.0352103566 0.3282001863 1r11 0.1901308947 0.5532532962 0.0002811341 0.0002811777 0.2560534974 1r12 0.3384214744 0.4279357679 0.1207937032 0.1128139993 0.0000350552 1r13 0.8520671572 0.0001154451 0.0462275639 0.1014744017 0.0001154321 1r14 0.8163828385 0.1158173532 0.0000841640 0.0213096942 0.0464059501 1r15 0.4869992267 0.2536825278 0.0000751098 0.0210530230 0.2381901127 1
Puengittogether• Ifwehaveaninfinitenumberofrandomsamplees8matesfromthe
samepopula8on,themeanofthisdistribu8onofes8mateswillconvergetothepopula8onmean.
• Giventhatwecanneverreallyhaveaninfinitenumberofsamples,theasympto8ctheoryofprobabilitydistribu8onssuggeststhat,withlargersamplesizes,wecaninferwithgreaterdegreesofprobabilis8cconfidencewhetherornotoursinglesamplesta8s8caccuratelyreflectstheunknownpopula8onparameter.
• Asn approachesN,thesamplingerrorgetssmallerandsmaller,meaningthatthereliabilityofoures8mategetsbeherandbeher.Thisisbecause,theore8cally,ifthesamplesize(n)keepsgrowing,itwilleventuallyjustbethepopula8on(N)!
Puengittogether
• Ofcourse,suchatheoryrequiresthatwemakeassump8onsabouttheshapeoftheunknownpopula8ondistribu8on.
• Otherwisewedon’tknowwhatnisapproxima8ng!
CentralLimitTheorem
• Luckyforus,someveryintelligentsta8s8cianswhocamebeforeusno8cedthat,assamplesizesgrewlargerandlarger,thedistribu8onofsamplemeansbecomesapproximatelynormal—regardlessofwhetherornottheparameteritselfisnormallydistributed.ThisistheCentralLimitTheorem(CLT).
• Bynormaldistribu.on,wemeanasymmetricdistribu8onwhereapproximatelyhalfofthedatafalltoeithersideofthemean.Itiscommonlyknownasadistribu8onthatfollowsabellcurve.
CentralLimitTheorem
AccordingtotheCLT,wecanexpectthat,withanormaldistribu8onofrandomsamplemeans,approximately68%ofthesamplemeanswillbewithinonestandarddevia8ononeithersideofthepopula8onmean(μ).95%willbewithintwo,and99.7%withinthree.
*FigurefromMathIsFunwebsite(hhps://www.mathsisfun.com/data/standard-normal-distribu8on.html).
CentralLimitTheoremNo8cehowthedatabecomemoresymmetricaboutthemeanasthesamplesizeincreases.Assuch,largesamplesizescanserveas“proxies”forrepeatedrandomsamplesandjus8fytheCLT.
0.1
.2.3
.4.5
Density
-1 0 1 2 3n50
0.1
.2.3
.4Density
-4 -2 0 2 4n500
0.1
.2.3
.4Density
-4 -2 0 2 4n5000
0.1
.2.3
.4Density
-4 -2 0 2 4n50000
StandardError• So,whatcanwesayaboutpopula8onparametersgivenasample
sta8s8candthesepopula8ondistribu8onassump8ons?• Forstarters,wecancalculatethestandarddevia8onofthe
theore8caldistribu8onofrandomsamplemeansaroundtheunknownpopula8onmean—alsoknownasthestandarddevia8onofthesamplingdistribu8on.Thisisknownasthestandarderror,andcanbefoundwith:
– Whereisthestandarddevia8onofthepopula8onparameterandthedenominatoristhesquarerootofthesamplesize.
StandardError
• Ofcourse,isusuallynotknown,soweusethesamplestandarddevia8onasanapproxima8on:
• Ormoresimply:
StandardError
• Whatdoesthestandarderrorofthesystolicbloodpressurevariabletellus?Howisthisdifferencefromthestandarddevia8on?
bpsystol 10337 130.8826 .2295796 130.4325 131.3326 Variable Obs Mean Std. Err. [95% Conf. Interval]
. ci bpsystol
bpsystol 10337 130.8826 23.34159 65 300 Variable Obs Mean Std. Dev. Min Max
. sum bpsystol
StandardError
• Arandomsamplemeandrawnfromthepopula8on(suchasthisone)likelydiffersfromthepopula8onsystolicbpbyabout0.23mm/Hg.
bpsystol 10337 130.8826 .2295796 130.4325 131.3326 Variable Obs Mean Std. Err. [95% Conf. Interval]
. ci bpsystol
bpsystol 10337 130.8826 23.34159 65 300 Variable Obs Mean Std. Dev. Min Max
. sum bpsystol
StandardError
• Thestandarddevia8on,however,ismerelyadescrip8veindica8onofvariabledispersion.Theaveragerespondentinthesampledivergesabout23.34mm/Hg.fromthemean.
bpsystol 10337 130.8826 .2295796 130.4325 131.3326 Variable Obs Mean Std. Err. [95% Conf. Interval]
. ci bpsystol
bpsystol 10337 130.8826 23.34159 65 300 Variable Obs Mean Std. Dev. Min Max
. sum bpsystol
StandardError
• Justtocheckthemath:
bpsystol 10337 130.8826 .2295796 130.4325 131.3326 Variable Obs Mean Std. Err. [95% Conf. Interval]
. ci bpsystol
bpsystol 10337 130.8826 23.34159 65 300 Variable Obs Mean Std. Dev. Min Max
. sum bpsystol
StandardError• No8cethatgetssmallerwhentwothingshappen:– (1)Whens,thestandarddevia8on,issmall.– (2)Whenthesamplesizeislarge.
• Butalsonotethatsitselfissmallerwhenthesamplesizeislarger.
• Whennislarge—andthereforeacloserapproxima.onofN—thesamplingdistribu.onvarieslessanditismorelikelythatthesamplemeanrepresentsthepopula.onmean!
ConfidenceIntervalforMean
• Wecanalsousethestandarderrorandourknowledgeofthenormaldistribu8ontoconstructaconfidenceintervalaroundthemean—thatis,thebandofvalueswithinwhichthepopula8onmean,µ,islikelytoreside.
ConfidenceIntervalforMean
• Theconfidenceintervalcanbefoundwith: or
• Wherezisourcri.calvalue:i.e.,thenumberofstandarddevia8onsawayfromthemeanthatrepresenttherangeofprobabili8eswithinwhichwethinkthepopula8onmeanresides.
Wait…z-value?What’sthat? • ThinkbacktowhattheCLTtellsus:– About68%ofsamplemeansfallwithinaboutonestandarddevia8ononeithersideofthepopula8onmean.
– About95%fallwithinabouttwo.– About99.7%fallwithinaboutthree.
• Wecanusethisinforma8ontofindthestandarddevia8onsthatcorrespondtothedistribu8onpercen8lesthatcapturethesepercentages.
Wait…z value?What’sthat?
• Forexample,thoughwesaythat95%ofthees8matesfallwithinabouttwostandarddevia8onsofthepopula8onmean,themoreprecisenumberis1.96.Itisourz value!
Thatis,about95%ofthesamplemeansfallwithin±1.96standarddevia.onsofthepopula8onmean.(Nevermindthat0—forourpurposes,thinkofitasµ.)*PhotofromWikipedia(hhps://en.wikipedia.org/wiki/1.96).
ConfidenceIntervalExample• Themeanweight(inkilograms)inourNHANESsampleis
71.90.Thestandarddevia8onis15.36,andoursamplesizeis10,337.Withinwhatrangeofkilogramscanwebe95%confidentincludesthepopula8onmean?
.
weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max
. sum weight
ConfidenceIntervalExample• Let’sstartbyfirstcompu8ngthestandarderror:
.
weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max
. sum weight
ConfidenceIntervalExample• Wewanttocapturethepopula8onmeanwithinthebandof
valuesthat,accordingtotheCLT,likelyfallwithin±1.96standarddevia8onsfromthepopula8onmean.Assuch:
.
weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max
. sum weight
ConfidenceIntervalExample
• Thereisa95%chancethattheintervalbetween71.604kg.and72.197kg.containsthemeanpopula8onweight.
.
weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max
. sum weight
ConfidenceIntervalExample
• ConfirmwithStata:
weight 10337 71.90088 .1510277 71.60484 72.19692 Variable Obs Mean Std. Err. [95% Conf. Interval]
CIexamplewithdifferentcri8calvalue
• Whatifwewantedtobe,say,99%confidentthatourintervalcontainsµ?
• Thecri8calz-valuefora99%confidenceintervalis2.58.Thismeansthat,followingtheCLT,weexpectabout99%ofsamplemeanspulledrandomlyfromoursamplingdistribu8onfallwithin±2.58standarddevia8onsofthepopula8onmean.
CIexamplewithdifferentcri8calvalue
• Let’sdothemath:
.
weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max
. sum weight
CIexamplewithdifferentcri8calvalue
• Wecansaythat,998mesoutof100,wehavecapturedthemeanpopula8onweightwiththeintervalbetween71.51kg.and72.29kg.
.
weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max
. sum weight
CIexamplewithdifferentcri8calvalue
• ConfirmwithStata:
.
weight 10337 71.90088 .1510277 71.51179 72.28997 Variable Obs Mean Std. Err. [99% Conf. Interval]
ConfidenceIntervalPrecision
• Notethattheconfidenceintervalgetsbiggerwhenwegofrom95%to99%confidence.– For95%CI:72.197–71.605=.592– For99%CI:72.290–71.512=.778
• Thisisbecausewehavetohavelessprecisionwhenwetrytobemoreconfident!
z-scores • Recallthatcri8calz-values(e.g.,±1.96and±2.58)arethe
standarddevia8onsawayfromµthatwewouldexpecttocapture95%and99%ofrandomsamplemeans(respec8vely)inanormalsamplingdistribu8on.
• Theore8cally,thez-valueforanygivencasecanbecalculatedwith.Thisvaluewouldtellushowmanystandarddevia8onsthecaseisfromµ.
• Wecanapplythissamelogictoanindividualvariabletoquan8fyhowfaraspecificcaseisfromthevariablemean.Thismeasureiscalledaz-score,orastandardizedscore.
z-scores • Thez-scoreforanindividualcasecanbefoundbysubtrac8ngthevariablemeanfromtherawscoreandthendividingthedifferencebythevariablestandarddevia8on:
– Where,asbefore,isthemeanforthevariableands isthevariablestandarddevia8on.
z-scoreexample • Belowisthedistribu8onofsystolicbloodpressurereadingsforthe
NHANESsample.Themeanis130.88mm/Hg.Thegreenbarisapar8cularvalueofthevariable:110mm/Hg.Thestandarddevia8onis23.34mm/Hg.Whatisthez-scoreforthiscase,andwhatdoesthisnumbermean?
050
010
00Fr
eque
ncy
50 70 90 110 130 150 170 190 210 230 250 270 290systolic blood pressure
z-scoreexample
050
010
00Fr
eque
ncy
50 70 90 110 130 150 170 190 210 230 250 270 290systolic blood pressure
z-scoreexample
050
010
00Fr
eque
ncy
50 70 90 110 130 150 170 190 210 230 250 270 290systolic blood pressure
• Acasewithasystolicbloodpressurereadingof110mm/Hg.isalihlelessthan1standarddevia8onbelowthemean.
z-scoreexample • ConfirmingwithStata.Notethatthesamebariscolored
green.That’sbecausetheyarethesamecases!
050
010
00Fr
eque
ncy
50 70 90 110 130 150 170 190 210 230 250 270 290systolic blood pressure
050
010
00Fr
eque
ncy
-2 0 2 4 6 8zsystol
Conclusion• Wehaveseenhowtheasympto8ctheoryofprobabilitydistribu8onsallowsustoassesshowwelloursinglesamplemeanrepresentsthetruepopula8onmeanintheabsenceofrepeatedrandomsamples.
• ItdoesthisbyfollowingtheCLT.Thisallowsustouseoursamplesizetoes8matehowwellhypothe8calrandomsamples(ofthesamesize)wouldapproximateanormaldistribu8onandthereforeapproximatethepopula8onmean.
Conclusion
• Thoughstandarderrorsandconfidenceintervalshelpusgetanideaofwherethepopula8onmeanmaybe,howdoweknowthesenumbersarereliable?Thatis,howdoweknowthatoures8matesaren’tjusttheproductofsamplingerror?
• Thisisthejobofsta.s.calinference—anditisthetopicforthenextsession!
DatasetsUsed
• TheStatasurveydocumenta8ondata,nhanes2f,fromtheStataPresswebsite.RetrievedJuly24,2016(hhp://www.stata-press.com/data/r11/svy.html).
Recommended