Day 2, Morning: The Logic of Distribu8ons · Day 2, Morning: The Logic of Distribu8ons Instructor:...

Preview:

Citation preview

Day2,Morning:TheLogicofDistribu8ons

Instructor:MarshallA.Taylor844FlannerHallmtaylo15@nd.edumarshalltaylor.net

Recap

•  Yesterdaymorningwetalkedabout:– Thedescrip8veandinferen8alpurposesofsta8s8cs

– Thedifferencebetweensamples,popula8ons,andtheissuesthatarisebecauseofsamplingerrorandsamplebias—andthewaysinwhichprobabilitytheorycanbeusedtoaddresstheformer.

– Thebasicsofprobabilitytheory

Recap

•  Yesterdayeveningwetalkedabout:– Thedifferencebetweenunivariate,bivariate,andmul8variatesta8s8cs.

– Whatavariableisandthegeneralformsthatitcantake:nominal,ordinal,orcon8nuous.

– Measuresofcentraltendencyforeachtypeofvariable.

– Measuresofdispersion.

GamePlanforToday•  Morning– Wewillbringthesepreviouslecturestogethertoshowhowwecanuseprobabilitytheorytoassesshowrepresenta8veoursampledistribu.onisofthepopula.ondistribu.on.

•  Evening– AReroutliningtheasympto.ctheoryofprobabilitydistribu.onsinthemorning,wewillthenexaminebasicunivariatetestsofsta.s.calinferencetoquan8fyhowwelloursamplesta8s8csapproximatepopula8onparametersnetofsamplingerror.

Whatisadistribu8on?

Insta8s8cs,adistribu.onissimplythearrayofvaluesforoneormorevariablesacrossasetofunits(people,groups,etc.).

Thedistribu8oniseverything

•  Theconceptof“distribu8on”hasbeenattheimplicitcenterofabsolutelyeverythingwehavetalkedaboutsofar!

•  Wehavespecificallylookedatsamplesta8s8cdistribu8ons,suchas…

FrequencyDistribu8ons

Total 10,335 100.00 excellent 2,407 23.29 100.00 good 2,591 25.07 76.71 average 2,938 28.43 51.64 fair 1,670 16.16 23.21 poor 729 7.05 7.05 5=excellent Freq. Percent Cum.1=poor,...,

. tab health

Frequencydistribu8onsas“histograms”

050

010

0015

00Fr

eque

ncy

50 100 150 200 250 300systolic blood pressure

Bimodaldistribu8onwithcategoricalvariables

0.2

.4.6

.8

Male Female

Solid R or D Likely R or DLeaning R or D Toss-Up

Bimodaldistribu8onwithcon8nuousvariables

020

040

060

0Fr

eque

ncy

20 40 60 80age in years

Notethetwopeaks.

Distribu8onsandProbabilityTheory

•  Distribu8onsservemuchmorethanadescrip8vepurpose.

•  Werelyontheasympto.ctheoryofprobabilitydistribu.onstomakesta.s.calinferences.

•  Suchatheorygivesusdependableideasaboutwhatthesamplingdistribu.onwilllooklike.

Whatis“asympto8c”?

•  “Asympto8c”referstothepropertythat,ifsampledaninfinitenumberof8mes,asta8s8cwillconvergetothepopula8onparameteritismeanttoapproximate.

•  Thatis,as,assumingthatallnarerandomsamplesofN.

Whatisa“probabilitydistribu8on”?

•  A“probabilitydistribu8on”isanarrayofprobabilis8cvaluesforavariableacrossasetofunits,wherethevaluesarepropor8onsthatmustsumto1.

V1 V2 V3 V4 V5 RMargins1 0.0000576938 0.1088889841 0.1090984138 0.0000576950 0.7818972133 12 0.0000476265 0.0000476352 0.9123043912 0.0119956500 0.0756046971 13 0.0000881123 0.0179937158 0.0264640660 0.0000881160 0.9553659899 14 0.1108000716 0.2503997467 0.5927814401 0.0459855150 0.0000332266 15 0.2411895005 0.0404012131 0.6524702741 0.0001041705 0.0658348418 16 0.1470047772 0.0002170118 0.6761725881 0.0002169660 0.1763886570 17 0.8856909695 0.0001638012 0.0001637739 0.0001637537 0.1138177018 18 0.2807926994 0.2093476438 0.0530959614 0.0000664792 0.4566972163 19 0.5049335568 0.0260158232 0.0000863853 0.4424611335 0.0265031011 110 0.0203117985 0.3800703244 0.2362073342 0.0352103566 0.3282001863 111 0.1901308947 0.5532532962 0.0002811341 0.0002811777 0.2560534974 112 0.3384214744 0.4279357679 0.1207937032 0.1128139993 0.0000350552 113 0.8520671572 0.0001154451 0.0462275639 0.1014744017 0.0001154321 114 0.8163828385 0.1158173532 0.0000841640 0.0213096942 0.0464059501 115 0.4869992267 0.2536825278 0.0000751098 0.0210530230 0.2381901127 1

Whatisa“probabilitydistribu8on”?

v1 v2 v3 v4 v5 RMarginsr1 0.0000576938 0.1088889841 0.1090984138 0.0000576950 0.7818972133 1r2 0.0000476265 0.0000476352 0.9123043912 0.0119956500 0.0756046971 1r3 0.0000881123 0.0179937158 0.0264640660 0.0000881160 0.9553659899 1r4 0.1108000716 0.2503997467 0.5927814401 0.0459855150 0.0000332266 1r5 0.2411895005 0.0404012131 0.6524702741 0.0001041705 0.0658348418 1r6 0.1470047772 0.0002170118 0.6761725881 0.0002169660 0.1763886570 1r7 0.8856909695 0.0001638012 0.0001637739 0.0001637537 0.1138177018 1r8 0.2807926994 0.2093476438 0.0530959614 0.0000664792 0.4566972163 1r9 0.5049335568 0.0260158232 0.0000863853 0.4424611335 0.0265031011 1r10 0.0203117985 0.3800703244 0.2362073342 0.0352103566 0.3282001863 1r11 0.1901308947 0.5532532962 0.0002811341 0.0002811777 0.2560534974 1r12 0.3384214744 0.4279357679 0.1207937032 0.1128139993 0.0000350552 1r13 0.8520671572 0.0001154451 0.0462275639 0.1014744017 0.0001154321 1r14 0.8163828385 0.1158173532 0.0000841640 0.0213096942 0.0464059501 1r15 0.4869992267 0.2536825278 0.0000751098 0.0210530230 0.2381901127 1

Puengittogether•  Ifwehaveaninfinitenumberofrandomsamplees8matesfromthe

samepopula8on,themeanofthisdistribu8onofes8mateswillconvergetothepopula8onmean.

•  Giventhatwecanneverreallyhaveaninfinitenumberofsamples,theasympto8ctheoryofprobabilitydistribu8onssuggeststhat,withlargersamplesizes,wecaninferwithgreaterdegreesofprobabilis8cconfidencewhetherornotoursinglesamplesta8s8caccuratelyreflectstheunknownpopula8onparameter.

•  Asn approachesN,thesamplingerrorgetssmallerandsmaller,meaningthatthereliabilityofoures8mategetsbeherandbeher.Thisisbecause,theore8cally,ifthesamplesize(n)keepsgrowing,itwilleventuallyjustbethepopula8on(N)!

Puengittogether

•  Ofcourse,suchatheoryrequiresthatwemakeassump8onsabouttheshapeoftheunknownpopula8ondistribu8on.

•  Otherwisewedon’tknowwhatnisapproxima8ng!

CentralLimitTheorem

•  Luckyforus,someveryintelligentsta8s8cianswhocamebeforeusno8cedthat,assamplesizesgrewlargerandlarger,thedistribu8onofsamplemeansbecomesapproximatelynormal—regardlessofwhetherornottheparameteritselfisnormallydistributed.ThisistheCentralLimitTheorem(CLT).

•  Bynormaldistribu.on,wemeanasymmetricdistribu8onwhereapproximatelyhalfofthedatafalltoeithersideofthemean.Itiscommonlyknownasadistribu8onthatfollowsabellcurve.

CentralLimitTheorem

AccordingtotheCLT,wecanexpectthat,withanormaldistribu8onofrandomsamplemeans,approximately68%ofthesamplemeanswillbewithinonestandarddevia8ononeithersideofthepopula8onmean(μ).95%willbewithintwo,and99.7%withinthree.

*FigurefromMathIsFunwebsite(hhps://www.mathsisfun.com/data/standard-normal-distribu8on.html).

CentralLimitTheoremNo8cehowthedatabecomemoresymmetricaboutthemeanasthesamplesizeincreases.Assuch,largesamplesizescanserveas“proxies”forrepeatedrandomsamplesandjus8fytheCLT.

0.1

.2.3

.4.5

Density

-1 0 1 2 3n50

0.1

.2.3

.4Density

-4 -2 0 2 4n500

0.1

.2.3

.4Density

-4 -2 0 2 4n5000

0.1

.2.3

.4Density

-4 -2 0 2 4n50000

StandardError•  So,whatcanwesayaboutpopula8onparametersgivenasample

sta8s8candthesepopula8ondistribu8onassump8ons?•  Forstarters,wecancalculatethestandarddevia8onofthe

theore8caldistribu8onofrandomsamplemeansaroundtheunknownpopula8onmean—alsoknownasthestandarddevia8onofthesamplingdistribu8on.Thisisknownasthestandarderror,andcanbefoundwith:

–  Whereisthestandarddevia8onofthepopula8onparameterandthedenominatoristhesquarerootofthesamplesize.

StandardError

•  Ofcourse,isusuallynotknown,soweusethesamplestandarddevia8onasanapproxima8on:

•  Ormoresimply:

StandardError

•  Whatdoesthestandarderrorofthesystolicbloodpressurevariabletellus?Howisthisdifferencefromthestandarddevia8on?

bpsystol 10337 130.8826 .2295796 130.4325 131.3326 Variable Obs Mean Std. Err. [95% Conf. Interval]

. ci bpsystol

bpsystol 10337 130.8826 23.34159 65 300 Variable Obs Mean Std. Dev. Min Max

. sum bpsystol

StandardError

•  Arandomsamplemeandrawnfromthepopula8on(suchasthisone)likelydiffersfromthepopula8onsystolicbpbyabout0.23mm/Hg.

bpsystol 10337 130.8826 .2295796 130.4325 131.3326 Variable Obs Mean Std. Err. [95% Conf. Interval]

. ci bpsystol

bpsystol 10337 130.8826 23.34159 65 300 Variable Obs Mean Std. Dev. Min Max

. sum bpsystol

StandardError

•  Thestandarddevia8on,however,ismerelyadescrip8veindica8onofvariabledispersion.Theaveragerespondentinthesampledivergesabout23.34mm/Hg.fromthemean.

bpsystol 10337 130.8826 .2295796 130.4325 131.3326 Variable Obs Mean Std. Err. [95% Conf. Interval]

. ci bpsystol

bpsystol 10337 130.8826 23.34159 65 300 Variable Obs Mean Std. Dev. Min Max

. sum bpsystol

StandardError

•  Justtocheckthemath:

bpsystol 10337 130.8826 .2295796 130.4325 131.3326 Variable Obs Mean Std. Err. [95% Conf. Interval]

. ci bpsystol

bpsystol 10337 130.8826 23.34159 65 300 Variable Obs Mean Std. Dev. Min Max

. sum bpsystol

StandardError•  No8cethatgetssmallerwhentwothingshappen:–  (1)Whens,thestandarddevia8on,issmall.–  (2)Whenthesamplesizeislarge.

•  Butalsonotethatsitselfissmallerwhenthesamplesizeislarger.

•  Whennislarge—andthereforeacloserapproxima.onofN—thesamplingdistribu.onvarieslessanditismorelikelythatthesamplemeanrepresentsthepopula.onmean!

ConfidenceIntervalforMean

•  Wecanalsousethestandarderrorandourknowledgeofthenormaldistribu8ontoconstructaconfidenceintervalaroundthemean—thatis,thebandofvalueswithinwhichthepopula8onmean,µ,islikelytoreside.

ConfidenceIntervalforMean

•  Theconfidenceintervalcanbefoundwith: or

•  Wherezisourcri.calvalue:i.e.,thenumberofstandarddevia8onsawayfromthemeanthatrepresenttherangeofprobabili8eswithinwhichwethinkthepopula8onmeanresides.

Wait…z-value?What’sthat? •  ThinkbacktowhattheCLTtellsus:– About68%ofsamplemeansfallwithinaboutonestandarddevia8ononeithersideofthepopula8onmean.

– About95%fallwithinabouttwo.– About99.7%fallwithinaboutthree.

•  Wecanusethisinforma8ontofindthestandarddevia8onsthatcorrespondtothedistribu8onpercen8lesthatcapturethesepercentages.

Wait…z value?What’sthat?

•  Forexample,thoughwesaythat95%ofthees8matesfallwithinabouttwostandarddevia8onsofthepopula8onmean,themoreprecisenumberis1.96.Itisourz value!

Thatis,about95%ofthesamplemeansfallwithin±1.96standarddevia.onsofthepopula8onmean.(Nevermindthat0—forourpurposes,thinkofitasµ.)*PhotofromWikipedia(hhps://en.wikipedia.org/wiki/1.96).

ConfidenceIntervalExample•  Themeanweight(inkilograms)inourNHANESsampleis

71.90.Thestandarddevia8onis15.36,andoursamplesizeis10,337.Withinwhatrangeofkilogramscanwebe95%confidentincludesthepopula8onmean?

.

weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max

. sum weight

ConfidenceIntervalExample•  Let’sstartbyfirstcompu8ngthestandarderror:

.

weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max

. sum weight

ConfidenceIntervalExample•  Wewanttocapturethepopula8onmeanwithinthebandof

valuesthat,accordingtotheCLT,likelyfallwithin±1.96standarddevia8onsfromthepopula8onmean.Assuch:

.

weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max

. sum weight

ConfidenceIntervalExample

•  Thereisa95%chancethattheintervalbetween71.604kg.and72.197kg.containsthemeanpopula8onweight.

.

weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max

. sum weight

ConfidenceIntervalExample

•  ConfirmwithStata:

weight 10337 71.90088 .1510277 71.60484 72.19692 Variable Obs Mean Std. Err. [95% Conf. Interval]

CIexamplewithdifferentcri8calvalue

•  Whatifwewantedtobe,say,99%confidentthatourintervalcontainsµ?

•  Thecri8calz-valuefora99%confidenceintervalis2.58.Thismeansthat,followingtheCLT,weexpectabout99%ofsamplemeanspulledrandomlyfromoursamplingdistribu8onfallwithin±2.58standarddevia8onsofthepopula8onmean.

CIexamplewithdifferentcri8calvalue

•  Let’sdothemath:

.

weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max

. sum weight

CIexamplewithdifferentcri8calvalue

•  Wecansaythat,998mesoutof100,wehavecapturedthemeanpopula8onweightwiththeintervalbetween71.51kg.and72.29kg.

.

weight 10337 71.90088 15.35515 30.84 175.88 Variable Obs Mean Std. Dev. Min Max

. sum weight

CIexamplewithdifferentcri8calvalue

•  ConfirmwithStata:

.

weight 10337 71.90088 .1510277 71.51179 72.28997 Variable Obs Mean Std. Err. [99% Conf. Interval]

ConfidenceIntervalPrecision

•  Notethattheconfidenceintervalgetsbiggerwhenwegofrom95%to99%confidence.– For95%CI:72.197–71.605=.592– For99%CI:72.290–71.512=.778

•  Thisisbecausewehavetohavelessprecisionwhenwetrytobemoreconfident!

z-scores •  Recallthatcri8calz-values(e.g.,±1.96and±2.58)arethe

standarddevia8onsawayfromµthatwewouldexpecttocapture95%and99%ofrandomsamplemeans(respec8vely)inanormalsamplingdistribu8on.

•  Theore8cally,thez-valueforanygivencasecanbecalculatedwith.Thisvaluewouldtellushowmanystandarddevia8onsthecaseisfromµ.

•  Wecanapplythissamelogictoanindividualvariabletoquan8fyhowfaraspecificcaseisfromthevariablemean.Thismeasureiscalledaz-score,orastandardizedscore.

z-scores •  Thez-scoreforanindividualcasecanbefoundbysubtrac8ngthevariablemeanfromtherawscoreandthendividingthedifferencebythevariablestandarddevia8on:

– Where,asbefore,isthemeanforthevariableands isthevariablestandarddevia8on.

z-scoreexample •  Belowisthedistribu8onofsystolicbloodpressurereadingsforthe

NHANESsample.Themeanis130.88mm/Hg.Thegreenbarisapar8cularvalueofthevariable:110mm/Hg.Thestandarddevia8onis23.34mm/Hg.Whatisthez-scoreforthiscase,andwhatdoesthisnumbermean?

050

010

00Fr

eque

ncy

50 70 90 110 130 150 170 190 210 230 250 270 290systolic blood pressure

z-scoreexample

050

010

00Fr

eque

ncy

50 70 90 110 130 150 170 190 210 230 250 270 290systolic blood pressure

z-scoreexample

050

010

00Fr

eque

ncy

50 70 90 110 130 150 170 190 210 230 250 270 290systolic blood pressure

•  Acasewithasystolicbloodpressurereadingof110mm/Hg.isalihlelessthan1standarddevia8onbelowthemean.

z-scoreexample •  ConfirmingwithStata.Notethatthesamebariscolored

green.That’sbecausetheyarethesamecases!

050

010

00Fr

eque

ncy

50 70 90 110 130 150 170 190 210 230 250 270 290systolic blood pressure

050

010

00Fr

eque

ncy

-2 0 2 4 6 8zsystol

Conclusion•  Wehaveseenhowtheasympto8ctheoryofprobabilitydistribu8onsallowsustoassesshowwelloursinglesamplemeanrepresentsthetruepopula8onmeanintheabsenceofrepeatedrandomsamples.

•  ItdoesthisbyfollowingtheCLT.Thisallowsustouseoursamplesizetoes8matehowwellhypothe8calrandomsamples(ofthesamesize)wouldapproximateanormaldistribu8onandthereforeapproximatethepopula8onmean.

Conclusion

•  Thoughstandarderrorsandconfidenceintervalshelpusgetanideaofwherethepopula8onmeanmaybe,howdoweknowthesenumbersarereliable?Thatis,howdoweknowthatoures8matesaren’tjusttheproductofsamplingerror?

•  Thisisthejobofsta.s.calinference—anditisthetopicforthenextsession!

DatasetsUsed

•  TheStatasurveydocumenta8ondata,nhanes2f,fromtheStataPresswebsite.RetrievedJuly24,2016(hhp://www.stata-press.com/data/r11/svy.html).

Recommended