SticiGui Confidence Intervals

Chapter26

ConfidenceIntervalsThischaptercontinuesourstudyofestimatingpopulationPARAMETERSfromRANDOMSAMPLES.InCHAPTER25,ESTIMATINGPARAMETERSFROMSIMPLERANDOMSAMPLES,westudiedESTIMATORSthatassignanumbertoeachpossiblerandomsample,andtheuncertaintyofsuchestimators,measuredbytheirRMSE.(TheRMSEisthesquarerootoftheexpectedvalueofthesquareddifferencebetweentheestimatorandtheparameterameasureofthetypicalsizeoftheerror.)Insteadofassigningasinglenumbertoeachsampleandreportingthesizeofatypicalerror,themethodsinthischapterassignanintervaltoeachsampleandreporttheCONFIDENCELEVELthattheintervalcontainstheparameter.Confidenceisatechnicaltermrelatedtoprobability.JustastheRMSEofanestimatormeasuresthelongrunaveragesizeoftheerrorinrepeatedsampling,buttheerrorforanyparticularsamplecouldbesmallerorlargerthantheRMSE,theconfidencelevelisthelongrunfractionofintervalsthatcontaintheparameterinrepeatedsampling,buttheintervalforanyparticularsamplemightormightnotcontaintheparameter.

Thestatement"theinterval[92%,94%]containsthepopulationpercentageatconfidencelevel90%"doesnotmeanthattheprobabilitythatthepopulationpercentageisbetween92%and94%is90%.(Theeventthattheinterval[92%,94%]containsthepopulationpercentageisnotrandom:Eitherthepopulationpercentageisbetween92%and94%,oritisnot.)Rather,thestatementmeansthatifweweretotakesamplesofsizenrepeatedlyandcomputea90%confidencelevelconfidenceintervalforthepopulationpercentagefromeachsampleofsizen,thelongrunfractionofintervalsthatcontainthepopulationpercentagewouldconvergeto90%.

Thelengthoftheconfidenceintervalandtheconfidencelevelmeasurehowaccuratelyweareabletoestimatetheparameterfromasample.Ifashortintervalhashighconfidence,thedataallowustoestimatetheparameteraccurately.Higherconfidencegenerallyrequiresalongerinterval,ceterisparibus,and,shorterintervalsgenerallyhavelowerconfidencelevels.Conventionalvaluesfortheconfidencelevelofconfidenceintervalsinclude68%,90%,95%,and99%,butsometimesothervaluesareused.Itiscrucialtoknowtheconfidencelevelassociatedwithaconfidenceinterval:Theintervalbyitselfismeaningless.

Conservativeconfidenceintervalsforpercentages

Inthissection,wedevelopconservativeconfidenceintervalsforthepopulationPERCENTAGEbasedontheSAMPLEPERCENTAGE,usingCHEBYCHEVSINEQUALITYandanupperboundontheSDofliststhatcontainonlythenumbers0and1.Conservativemeansthatthechancethattheprocedureproducesanintervalthatcontainsthepopulationpercentageisatleastlargeasclaimed.(Laterinthischapterwewillconsiderapproximateconfidenceintervals.)

Considera01BOXofNtickets.Thepopulationpercentagepisthefractionofticketslabeled"1:"

p=100%(#ticketsinthepopulationlabeled"1")/N,

ThepopulationpercentageisalsothePOPULATIONMEANofthenumbersonalltheticketsinthebox,ave(box).ThesamplepercentageofaSIMPLERANDOMSAMPLE(randomsamplewithoutreplacement)ofsizenfromthepopulationofNticketsis

=100%(#ticketsinthesamplelabeled"1")/n.

Thesamplepercentageisthesamplemeanofthelabelsontheticketsinthesample.TheEXPECTEDVALUEofthesamplepercentageisthepopulationpercentagep,andtheSEofthesamplepercentageis[+]

SE()=f(p(1p))/n

f50%/n,

wherefisthefinitepopulationcorrection

f=(Nn)/(N1).

Thusf50%/nisanupperboundontheSEofthesamplepercentage.

FIGURE261showswhathappensifwecenteranintervalatthesamplepercentage,andextendtheintervaldownandupfromthesamplepercentagebytwicetheupperboundontheSEofthesamplepercentage.Whentheintervalincludesthepopulationpercentage,wesaytheintervalCOVERSthetruth.Theintervalisrandom,becauseitiscenteredatthesamplepercentage,whichisrandom.ThechancethattherandomintervalwillcontainthetruepopulationpercentageiscalledtheCOVERAGEPROBABILITYoftheinterval.TakeafewsamplesbyclickingTakeSampletogetthefeelofthetoolthenincreaseSamplestoTaketo1000andclickTakeSampleagain.Theactualpercentageofintervalsthatcoverwillvary,butalmostalwaysitwillbelargerthan75%,sometimesnearly100%.Theempiricalpercentageofintervalsthatcoverisanestimateofthecoverageprobabilityoftheprocedure.VarythesamplesizeandputafewdifferentlistsofzerosandonesintothePopulationboxattherightofthefigure,andtryafewdifferentsamplesizesforeachpopulation.Youshouldfindthatthefractionofintervalsthatcoverthetruepopulationpercentagestaysabove75%(almostwithoutfail),nomatterwhatthepopulationofzerosandonesis.

Figure261:ConservativeConfidenceIntervalforthePopulationPercentage

Whydotheserandomintervalscoverthetruepopulationpercentagesooften?WecanshowthattheyshouldusingChebychev'sinequality.Because

SE()f50%/n,

theevent

|p|kSE()

isasubsetoftheevent

|p|kf50%/n.

Itfollowsthat

P(|p|kSE())P(|p|kf50%/n).

CHEBYCHEV'SINEQUALITYguaranteesthatthechancethesamplepercentagediffersfromitsexpectedvaluepbymorethanktimesitsSTANDARDERRORisatmost1/k2,so

11/k2P(|p|kSE())

P(|p|kf50%/n).

Thatis,

P(|p|kf50%/n)11/k2.

Therefore,inthelongruninrepeatedsampling,thefractionoftrialsinwhichthesamplepercentageiswithin2f50%/nofthepopulationpercentagepconvergestoanumberthatis75%orlarger.[+]Wheneveriswithin2f50%/nofthepopulationpercentagep,anintervalcenteredatextendingdownandupby2f50%/nwillcontainp.Thatis,theinterval

2f50%/n,

whichisshorthandfor

[2f50%/n,+2f50%/n],

containspatleast75%ofthetime,inthelongrun.Similarly,thefractionoftrialsinwhichiswithin3f50%/nofpconvergestoanumberthatis88.89%orlarger,sothelongrunfractionofintervals3f50%/nthatcontainpwillbe88.89%orlarger.Thefractionoftrialsinwhichiswithin

Samplefrom: Box withoutreplacementTakeSample HideBox

Samples:0SD(Box):0.5Ave(Box):0.5

0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

110001

SampleSize:3 Samplestotake:1 Intervals:+/2 * BoundonSE(01boxonly) 0%cover

4f50%/nofpconvergestoanumberthatis93.75%orlarger,sothelongrunfractionofintervals4f50%/nthatcontainpwillbe93.75%orlarger,etc.

Ingeneral,ifwegodownandupfromthesamplepercentagebykf50%/n,theninthelongruninrepeatedtrials,theresultingintervalswillincludethetruepopulationpercentageatleast11/k2ofthetime.

ChangetheIntervals:valueinFIGURE261to3andto4toconfirmempiricallythatthisistrue.

Theintervalkf(50%/n)israndom:Itscenterdependson,whichinturndependsonwhichUNITS(here,tickets)happentobeintherandomsample.Theprobabilityisintherandomsamplingprocedure,notintheparameter.ThePARAMETERisthesame,nomatterwhatsamplewehappentogettheparameterisapropertyofthepopulation,notthesample.Itistheintervalthatvarieswiththerandomsample.Beforethedataarecollected,thecoverageprobabilityisthechancethatsamplingwillresultinanintervalthatcontainstheparameter.

Takingthesampledeterminestheinterval,leavingnothingtochance:Theintervaltheprocedureproducedeitherdoesordoesnotcontainthepopulationpercentage.(Onecouldsaythataftercollectingthedata,thechancethattheintervalcoverstheparameteriseither0or100%.)Typically,weneverlearnwhethertheintervalcoverstheparameter,butourignoranceisnotaprobability(atleast,notaccordingtotheFREQUENCYTHEORYOFPROBABILITYusedinthisbook).

TheintervaltheproceduregivesforanyparticularsetofdataiscalledaCONFIDENCEINTERVAL.TheCONFIDENCELEVELofaCONFIDENCEINTERVALisequaltotheCOVERAGEPROBABILITYoftheprocedurebeforethedataarecollected.

CONFIDENCEisawordstatisticiansreserveforthisidea.If,beforecollectingthedata,theprocedureweareusinghasaP%chanceofproducinganintervalthatCOVERSthetruePOPULATIONPERCENTAGE,then,aftercollectingthedata,theintervaltheprocedureproducediscalledaP%CONFIDENCEINTERVAL.

CoverageProbabilityandConfidenceLevel

Considerapopulationparameter,andaprocedurethatproducesrandomintervals.SupposethattheprobabilitythattheprocedureproducesanintervalthatcontainstheparameterisP%.

1. TheprocedureissaidtohavecoverageprobabilityP%.2. Theintervaltheprocedureproducesforanyparticularsampleis

calledaP%confidenceintervalfortheparameter,oraconfidenceintervalfortheparameterwithconfidencelevelP%.

Inrepeatedsampling,aboutP%ofconfidenceintervalswithconfidencelevelP%willcontain(COVER)thePARAMETER.About(100P)%oftheintervalswillnotcovertheparameter.Foranyparticularsample,unlessthepopulationparameterisknown,wewillnotknowwhethertheconfidenceintervalcoversthePARAMETER.

CHAPTER25,ESTIMATINGPARAMETERSFROMSIMPLERANDOMSAMPLES,summarizedtheuncertaintyofanestimateofaparameterbytheMEANSQUAREDERRORorROOTMEANSQUAREDERRORoftheestimator,whicharemeasuresoftheaverageerroroftheestimatorinrepeatedsampling.Aconfidenceintervalisadifferentwayofexpressingtheuncertaintyinanestimate:arangeofvaluesthatcontainstheparameterwithspecifiedconfidencelevel.

TheinterpretationofconfidencelevelforaparticularintervalisanalogoustotheinterpretationofRMSEforaparticularvalueoftheestimate:TheRMSEisthesquarerootofthelongrunaveragesquarederroroftheestimatorinrepeatedsampling,butforanyparticularsample,theerrorcouldbelargerorsmallerthantheRMSEandwewillnotknowwhichunlessweknowthetruevalueoftheparameter.Theconfidencelevelmeasuresthelongrunfractionofintervalsthatcontaintheparameterinrepeatedsampling,butforanyparticularsample,theconfidenceintervaleitherwillorwillnotcontaintheparameterandwewillnotknowwhichunlessweknowthetruevalueoftheparameter.[+]

WecanusetheapproachdevelopedinthissectiontoconstructconfidenceintervalsforthePOPULATION

PERCENTAGEPwithothernominalconfidencelevels,byextendingtheintervalupanddownfromtheSAMPLEPERCENTAGEbylargerorsmalleramounts.Thelongertheintervals,thelargerthenominalconfidencelevelthelargerthechancethatanintervalwillcontainp.Theshortertheintervals,thesmallerthechancethatanintervalwillcontainp.Inparticular,ifwechooseksothat[+]

11/k2=P%,

thentheinterval

[kf50%/n,+kf50%/n]

isa(nominal)P%confidenceintervalforthepopulationpercentagep.

Conversely,togetanominalP%conservativeconfidenceintervalforthepopulationpercentageusingasimplerandomsample,weshouldtakeanintervalthatextendsdownandupfromthesamplepercentagebykf50%/n,with

k=(1P/100).

TheactualCOVERAGEPROBABILITYoftheinterval

[kf50%/n,+kf50%/n]

isgreaterthan(11/k2),fortworeasons.First,theSTANDARDERRORofthesamplepercentageislessthanf(50%/n)unlessthepopulationpercentagepis50%.Second,thedistributionofthesamplepercentageisthatofanhypergeometricrandomvariabledividedbythesamplesize,n,andsuchadistributioncannotattaintheboundinCHEBYCHEV'SINEQUALITY:EvenforthetrueSEofthesamplepercentage,

SE()=f(p(1p))/n,

thechancethatthesamplepercentageiswithinkSE()ofthepopulationpercentagepisgreaterthan11/k2:

P(|p|11/k2.

Asaresult,confidenceintervalsforthepopulationpercentagebasedonChebychev'sinequalityandtheupperboundof50%fortheSDofalistofzerosandonesareconservative:theactualCONFIDENCELEVELisgreaterthanthenominalconfidencelevel,(11/k2).Thenextsectiondevelopsaprocedurethatisnotconservative,butthatisapproximate:Theconfidencelevelcouldbelargerorsmallerthanthenominallevel.(Thenominalconfidencelevelisclosetotheactualconfidencelevelwhenthesamplesizenislarge.)

Apopulationpercentagecannotbelessthan0%.Ifthelowerendpointofaconfidenceintervalforapopulationpercentageisnegative,itiscompletelylegitimatetoreplacethelowerendpointbyzero:Itdoesnotdecreasetheconfidencelevel.Similarly,apopulationpercentagecannotbegreaterthan100%.Iftheupperendpointofaconfidenceintervalforapopulationpercentageisgreaterthan100%,itislegitimatetoreplacetheupperendpointby100%.Theconfidencelevelremainsthesame.Similarly,ifweareconstructingaconfidenceintervalforaquantitythatcannotbenegative(height,weight,orage,forinstance),removingnegativevaluesfromaconfidenceintervalcannotreducethecoverageprobabilityorconfidencelevel.

ConfidenceIntervalsforRestrictedParameters

Ifsomevaluesofaparameterareknowntobeimpossible,excludingthosevaluesfromaconfidenceintervaldoesnotreducetheconfidenceleveloftheconfidenceinterval.

Conversely,includingimpossiblevaluesofaparameterinaconfidenceintervaldoesnotincreasetheconfidencelevel.

Forexample,ifaconfidenceintervalforaparameterthatmustbepositivehasalowerendpointthatisnegative,thelowerendpointcanbereplaced

withzero.Theconfidencelevelremainsthesame.

Inparticular,ifthelowerendpointofaconfidenceintervalforapopulationpercentageisnegative,thelowerendpointcanbereplacedwithzero.Iftheupperendpointofaconfidenceintervalforapopulationpercentageisgreaterthan100%,theupperendpointcanbereplacedwith100%.

Wheneveryouuseaconfidenceinterval,itcrucialtoreporttheconfidencelevel.Otherwise,itisimpossibletointerprettheresult.Thechoiceoftheconfidencelevelisessentiallyarbitrary,butthechoiceshouldbemadebeforecollectingthedata.Commonvaluesoftheconfidencelevelare68%,90%,95%,and99%.Thereisatradeoffbetweenprecision(thelengthoftheconfidenceinterval),andconfidencelevel:Ceterisparibus,higherconfidencelevelsrequirelongerconfidenceintervals.

Thefollowingexercisechecksyourabilitytocomputeaconservativeconfidenceintervalforthepopulationpercentage.

Exercise261.TheenteringclassatNorthSouthcentralUniversitycontains600students.Thedean'sofficeseekstodeterminethepercentageofenteringstudentswhohavecreditcards.Thedean'sofficewilltakeasimplerandomsampleof40enteringstudents,interviewthem,andcomputethesamplepercentage.Theofficewouldliketoconstructaconservative75%confidenceintervalforthepercentageofenteringstudentswhohavecreditcards.Thecenteroftheintervalwillbethesamplepercentage.

Theintervalshouldextendupanddownfromthesamplepercentageby

Thesampleistaken,andthesamplepercentageisobservedtobe86%.

Thelowerendpointoftheconfidenceintervalshouldbe andtheupperendpointshouldbe

Theprobabilitythatthisintervalcontainsthepercentageofstudentsintheenteringclasswhohavecreditcards ?

Theconfidencelevelofthisinterval ?

[+Solution]

Conservativeconfidenceintervalsforpopulationmeansofboundedboxes

Recallthatpercentagesarejustmeansofspeciallistsofnumbers,liststhatcontainsonlyzerosandones.Wecanfindconfidenceintervalsforthemeansofmoregenerallistsofnumbers,too.

IntheprevioussectionweexploitedthefactthattheSDofa01boxisatmost1/2toconstructconservativeconfidenceintervalforthepopulationmeanofa01boxthatis,thepopulationpercentage.Theapproachcanbeusednotonlyfor01boxes,butwheneverwecanfindaboundontheSDofthebox,sothatwecanapplyChebychev'sinequality.Foranyboxofnumberedticketswhatsoever,thesamplemeanofasimplerandomsampleorrandomsamplewithreplacementisanunbiasedestimatorofthepopulationmeanofthenumbersonthetickets,andtheSEofthesamplemeanisproportionaltotheSDofthebox.

Forinstance,supposeweknowthatthenumbersontheticketsintheboxareallbetweenaandb,withab.ThenSD(box)isatmost(ba)/2.[+]Inthespecialcasethata=0andb=1,thisimpliesthattheSDofa01boxisatmost50%,aswehaveseenalready.

Thatinturnimpliesthatthemeansthatifallthenumbersinaboxarebetweenaandb,theSEofthesamplemeanofasimplerandomsampleofndrawsfromtheboxisatmostf(ba)/(2n),wherefistheFINITEPOPULATIONCORRECTION.AndtheSEofthesamplemeanofndrawswithreplacementfromtheboxisatmost(ba)/(2n).

SamplingfromaBoundedBox

Supposeallthenumbersinaboxarebetweenaandb,withab.Then:

SD(box)isatmost(ba)/2TheSEofthesamplemeanofndrawswithreplacementfromtheboxisatmost(ba)/(2n).TheSEofthesamplemeanofasimplerandomsampleofsizenfromtheboxisatmostf(ba)/(2n),wherefistheFINITEPOPULATIONCORRECTION.

WithaboundontheSE,wecanuseChebychev'sinequalitythesamewaywedidforthepopulationpercentagetogetaconfidenceintervalforthepopulationmeanofthenumbersontheticketsinabox:

ConservativeConfidenceIntervalsforthePopulationMeanofaBoundedList

Supposeallthenumbersinaboxarebetweenaandb,whereab.

Forasimplerandomsampleofsizen,thechancethattherandominterval

[(samplemean)kf(ba)/(2n),(samplemean)+kf(ba)/(2n)]

includesthemeanofthenumbersintheboxisatleast11/k2,wherefisthefinitepopulationcorrection(Nn)/(N1),Nisthepopulationsize,andnisthesamplesize.

Forrandomsamplingwithreplacement,thechancethattherandominterval

[(samplemean)k(ba)/(2n),(samplemean)+k(ba)/(2n)]

includesthemeanofthenumbersintheboxisatleast11/k2.

Inbothcases,ifthelowerendpointoftheintervalislessthana,itcanbereplacedbya,andiftheupperendpointoftheintervalisgreaterthanb,itcanbereplacedbyb.

Theseareconservativeproceduresforconstructingconfidenceintervals:theprobabilitythattheintervalstheyproducecoverthetruepopulationmeanisgreaterthantheprobabilitytheyclaim,11/k2(thenominalcoverageprobability).

Approximateconfidenceintervalsforpercentages

ConfidenceintervalsforthepopulationpercentagebasedonChebychev'sinequalityandtheupperboundof50%fortheSDoflistsofzerosandonesareconservative:Theirtrueconfidencelevelisgreaterthantheirnominalconfidencelevel,(11/k2).Wecoulduseshorterintervalsandstillhaveconfidencelevel(11/k2),orwecouldclaimaconfidencelevelhigherthan(11/k2).

Howmuchshortercouldtheintervalbe,orhowlargeaconfidencelevelcouldweclaim?Itispossibletofigurethesethingsoutprecisely,[+]butweshallfollowastandardapproximateapproachinstead,onethatwecanextendtoothersituations.WeshallusetheCENTRALLIMITTHEOREMtodevelopaprocedurethatproducesshorterconfidenceintervalsforagivennominalconfidencelevel.Thenewprocedurewillbeapproximateinsteadofconservative:thecoverageprobabilitywillbeclosetothenominalcoverageprobabilitywhenthesamplesizeislarge,butcouldbesmallerorlargerdependingonthepopulationpercentage,andcouldbequitedifferentfromthenominalcoverageprobabilityforsmallsamplesfrompathologicalpopulations.

Weshallassumethroughouttherestofthischapterthateither

thesampleisdrawnwithreplacement,orthesamplesizenismuch,muchsmallerthanthepopulationsizeN.

Withthisassumption,wecanneglecttheFINITEPOPULATIONCORRECTIONandactasiftheticketsinthesampleweredrawnindependently.(SeeCHAPTER22,STANDARDERROR.)Whentheticketsaredrawnindependently,theCENTRALLIMITTHEOREMtellsusthatasthesamplesizegrows,theNORMALCURVEisabetterandbetterapproximationtothePROBABILITYHISTOGRAMoftheSAMPLEPERCENTAGE(andtotheprobabilityhistogramoftheSAMPLEMEAN).TheNORMALAPPROXIMATIONtotheprobabilitythatthesamplepercentageisintheinterval

[p1.15(p(1p))/n,p+1.15(p(1p))/n]

isequaltotheareaundertheNORMALCURVEforthecorrespondingrangeofvaluesinSTANDARDUNITS,[1.15,1.15].Theareaunderthenormalcurvebetween1.15and1.15isabout75%:

Selectedarea:74.99%Lowerendpoint: 1.15 Upperendpoint: 1.15

Thisismuchlargerthantheboundof(11/(1.15)2)=24.4%thatCHEBYCHEV'SINEQUALITYgives.Whenthesamplepercentageiswithin

1.15(p(1p))/n

ofp,piswithin

1.15(p(1p))/n

ofthesamplepercentage,sotheprobabilitythattheinterval

I=[1.15(p(1p))/n,+1.15(p(1p))/n]

containsthepopulationpercentagepisabout75%:ThecoverageprobabilityofIisapproximately75%.

Unfortunately,wecannotconstructIfromthesamplealone:thesampledeterminesthecenterofI,buttofindthelengthofIweneedtoknowp(1p),whichistantamounttoknowingp.[+]Ifweknewp,wewouldnotbeestimatingit.

Ifthesamplesizenislarge,theSAMPLESTANDARDDEVIATIONS

s=((n/(n1))(1)),

islikelytobeclosetotheSDofthepopulationwhenthathappens,

s/n

isclosetoSE(),thestandarderrorofthesamplepercentage.Therefore,ifthesamplesizeislarge,buteitherthesampleissmallcomparedtothepopulationorthesampleistakenwithreplacement,theprobabilitythattherandominterval

[1.15s/n,+1.15s/n]

containsthepopulationpercentagepisabout75%.Thisintervalhasnotonlyarandomcenter(thesamplepercentage),butalsoarandomlength(thelengthdependsontheobservedvalueofs,andsisrandom,becauseitdependsontherandomsample).

FigureFIGURE262letsyoutrytheprocedureyourself.EachtimeyouclicktheTakeSamplebutton,asampleisdrawnwithreplacementfromthenumbersintheboxontheright(initiallysettoarandomlistofzerosandones).Thesamplesizeinitiallyissetto30.Thecontrolsatthebottomofthefigureallowyoutochangethesizeofeachsample,thenumberofsamplesthataretakeneachtimeyouclickthebutton,andthewidthoftheinterval,asamultipleoftheestimatedSEortheconservativeboundontheSE.(TheestimatedSEisS/nbecausewearesamplingwithreplacementtheboundis0.5/n.)AlabelinthebottomrightcornerreportsthefractionofintervalsthatCOVERthepopulationpercentage.Intervalsthatcoveraregreenthosethatdonotcoverarered.Asmallblackdotmarksthemiddleofeachinterval(thesamplepercentage).Ablueverticallinemarksthetruepopulationpercentagep.

Figure262:Approximateconfidenceintervalsforthepopulationmeanandpercentage

5 4 3 2 1 0 1 2 3 4 5

Samplefrom: Box withreplacementTakeSample HideBox

Samples:0SD(Box):0.49Ave(Box):0.4

00101

TakeafewsamplestogetthefeelofthetoolthenincreasetheSamplestotaketo1000,andclicktheTakeSamplebuttonagain.Theactualpercentageofintervalsthatcoverwillvary,butshouldbereasonablycloseto75%.IncreaseSamplesizeto200andtryagainthepercentageofintervalsthatcovershouldbecloserto75%.TryputtingafewdifferentlistsofzerosandonesintothePopulationboxattherightofthefigure,andtryafewdifferentsamplesizesforeachpopulation.Whenthesamplesizeislarge,thefractionofintervalsthatcoverthetruepopulationpercentagewillbeverycloseto75%.

Thefollowingexercisescheckyourabilitytocomputeconservativeandapproximateconfidenceintervalsforthepopulationpercentage,andyourabilitytodeterminewhichmethodismoreappropriate.

VideosofExercises

(Reminder:Examplesandexercisesmayvarywhenthepageisreloadedthevideoshowsonlyoneversion.)

0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

SampleSize:30 Samplestotake:1 Intervals:+/1.15 * EstimatedSE 0%cover

Exercise262.IwouldliketoknowthefractionofUCBerkeleyundergraduateswhocommutetoschoolfromtheirparents'homes.Isendemailtostudentswithcampuscomputeraccountsuntil100haveresponded3oftheresponderswerecommuters.

Anapproximate95%confidenceintervalforthefractionofUCBerkeleyundergraduateswhocommutetoschoolisfrom to

[+Solution]

Exercise263.IwouldliketoknowthefractionofhomesinAlamedaCounty,California,thathaveassessedvaluesof$700,000ormore.Itakeasimplerandomsampleofsize500fromtheAlamedaCountypropertytaxrecords(somehow).Thesamplepercentageofhomesassessedat$700,000ormoreis10%.

Anapproximate96%confidenceintervalforthepercentageofhomesassessedat$700,000ormoreisfrom to

[+Solution]

Exercise264.Arandomsamplewithreplacementofsize20wastakenfromaboxoftickets.Eachticketintheboxisnumberedeitherzeroorone.Sixoftheticketsinthesamplearelabeled"1"therestarelabeled"0."

Thesamplepercentageis

Thesamplesize ? largeenoughtojustifyassumingthatsisclosetoSD(box)andusingaconfidenceintervalbasedonthenormaldistribution.

TheSEofthesamplepercentageis ?

Aconservative70%confidenceintervalforthepopulationpercentageisfrom to

[+Solution]

Exercise265.Arestauranteurplanstochangethemenuinherrestaurant,whichspecializesingamemeats.Sheistryingtodecidewhetherornottooffervenisongoulashonthenewmenu.Eachdayforamonth,shepickspeopleatrandomastheycomeintotherestaurant,andasksthemwhethertheywouldordervenisongoulashifitwereoffered.Onbusydays,shepicksmorepeopleonquietdays,shepicksfewerpeople.Supposethatineffectshehasasimplerandomsampleof160peoplewhoeatatherrestaurant.Supposefurtherthatthenumberofdinersismuchmuchlargerthanthesample.Inthesample,118saytheywouldordervenisongoulashifitwereoffered.

Thesamplepercentageofdinerswhosaytheywouldordervenisongoulashis

Thebootstrapestimateofthepopulationstandarddeviationis

A98%confidenceintervalforthepercentageofdinerswhowouldsaytheywouldordervenisongoulashamongthepopulationofpeoplewhoeatatthatrestaurantwouldgofrom to

[+Solution]

ApproximateConfidenceIntervalsforthePopulationMean

Supposethatweseekaconfidenceintervalforthemeanofapopulation(box)ofnumbers,basedonarandomsamplefromthepopulation.TheSAMPLEMEANisanUNBIASEDestimatorofthepopulationmean(E(SAMPLEMEAN)=AVE(box)),soitisreasonabletocenteraconfidenceintervalatthesamplemean.Howwideshouldwemakeanintervalcenteredatthesamplemean,fortheintervaltohaveaspecifiedprobabilityofCOVERINGthePOPULATIONMEAN?

IfweknewtheSDofthepopulationorhadanupperboundontheSDofthepopulation,wecoulduseCHEBYCHEV'SINEQUALITYtoconstructaconservativeconfidenceintervalforthepopulationmean,aswedidearlierinthechapter:thestandarderrorofthesamplemeanis

SE(SAMPLEMEAN)=SD(box)/n,

wherenisthesamplesize.So,forexample,theCOVERAGEPROBABILITYoftherandominterval

[(samplemean)2SD(box)/n,(samplemean)+2SD(box)/n]

isatleast75%.

Typically,however,theSDofthepopulationisnotknown,sowecannotconstructthisinterval.Moreover,

typicallywecannotusetheconservativeapproachbasedonChebychev'sInequality,becausethereisnoupperboundontheSDofagenerallistofnumbersanalogoustotheupperboundof50%fortheSDofliststhatcontainonlyzerosandones.(Aswehaveseen,ifallthenumbersareboundedbetweenaandb,withab,thenSD(box)(ba)/2buttypicallywedonotknowsuchlowerandupperboundsaandb.)

However,theapproximateapproachtoconstructingconfidenceintervals,basedonthenormalcurve,worksifthesamplesizeissufficientlylarge.TheCENTRALLIMITTHEOREMtellsusthatthePROBABILITYHISTOGRAMoftheAVERAGEofndrawswithreplacementfromaboxfollowstheNORMALCURVEincreasinglywellasthenumberofdrawsnincreases.WealsoknowthatthesamplestandarddeviationsisincreasinglylikelytobeanaccurateestimateoftheSDofthepopulationasnincreases.Asaresult,theprobabilitythattheSAMPLEMEANiswithinzs/nisapproximatelythesameastheareaunderthenormalcurvebetweenzandz.Foranyfixedpopulation(box),theapproximationimprovesasthesamplesizenincreases,forrandomsamplingwithreplacement.ExampleEXAMPLE261illustratescalculatinganapproximateconfidenceintervalforthepopulationmean.Theexampleisdynamic:Itwilltendtochangewhenyoureloadthepage.

Example261:ApproximateConfidenceIntervalforthePopulationMean

Toassessscholasticperformance,astateadministersanachievementtesttoasimplerandomsampleof160highschoolseniors.Thereare40000highschoolseniorsinthestate.Themeanscoreofthestudentswhotooktheexamis104.34points,andthesamplestandarddeviationoftheirscoresis13.9points.Findanapproximate98%confidenceintervalfortheaverageofthepopulationscoresthatwouldhavebeenobtainedhadeveryhighschoolseniorinthestatebeenadministeredtheachievementtest.

Solution.Thesamplesize(160)isasufficientlysmallfractionofthepopulationsize(40,000)thattreatingthesampleasifitweredrawnwithreplacementisreasonable.Thesamplesizeissufficientlylargethatthenormalapproximationtothedistributionofthesamplemeanshouldbereasonablyaccurate,andthatthesamplestandarddeviationshouldbeclosetothestandarddeviationofthepopulation.Theareaunderthenormalcurvebetween2.326is98%:

Thus,anapproximate98%confidenceintervalwouldbecenteredatthesamplemean,andextenddownandupbyfromthesamplemeanby2.326standardunits.Theestimatedstandarderrorofthesamplemeanis

13.9/160=1.099points.

Theconfidenceintervalthusshouldextenddownandupfromthesamplemeanby

2.3261.099points,

sotheconfidenceintervalis

[101.784points,106.896points]

Thefollowingexercisechecksyourabilitytocalculateapproximateconfidenceintervalsforthepopulation

5 4 3 2 1 0 1 2 3 4 5Selectedarea:98%

Lowerendpoint: 2.326 Upperendpoint: 2.326

mean.Theexerciseisdynamic:Thequestionwilltendtochangewhenyoureloadthepage.

Exercise266.Todeterminetheaveragelifetimeoftheirlightemittingdiode(LED)lightbulbs,amanufacturertakesasimplerandomsampleof110bulbsfromamanufacturinglotof34,000bulbs.Themeanlifetimeofthebulbsinthesampleis93.49thousandhours,andthesamplestandarddeviationoftheirlifetimesis8.56thousandhours.

Anapproximate98%confidenceintervalfortheaveragelifetimeofthebulbsinthemanufacturinglotwouldextendfrom

thousandhours(low)to thousandhours(high)

.

[+Solution]

ExactConfidenceIntervalsforPercentagesWehaveseentwomethodsforconstructingconfidenceintervalsforapopulationpercentage:aconservativemethodbasedonChebychev'sInequalityandaboundonSD(box),andanapproximatemethodbasedonthenormalapproximation.Conservativemeansthatthecoverageprobabilityisatleastashighasclaimedbutcouldbesubstantiallyhigherforsomepopulations.Approximatemeansthatthecoverageprobabilityisroughlyashighasclaimedbutcouldbesubstantiallylower(orsubstantiallyhigher)forsomepopulations.Thissectiondevelopsathirdmethod,whichisexact.Exactmeansthattheprobabilitythattherandomintervalcoversthetruepopulationpercentageisjustwhatitisclaimedtobe(dependingonthevalueofitcanbeabithigher,simplybecausethebinomialdistributionisadiscretedistribution).

Theseintervalsareratherdifferentfromtheconfidenceintervalspresentedearlierinthischapter,whichwereoftheform(estimateuncertainty).Instead,eachoftheendpointsiscomputedfromthedata,separately.Theresultingintervalusuallyisnotsymmetricaroundthesamplepercentage.

Weassumethatasampleofsizenisdrawnatrandomwithreplacementfroma01box.Wewanttofindaconfidenceintervalforp,thepercentageofticketslabeled"1"inthebox.LetXbethenumberofticketsinthesamplethatarelabeled"1."Ifthetruepercentageofticketslabeled"1"inthe01boxisp,thenXhasaBINOMIALPROBABILITYDISTRIBUTIONwithparametersnandp.Wewillconstructaconfidenceintervalforpbylookingatthevaluesofpthatareplausible,giventheobservedvalueofX.TheapproachissimilartotheapproachwetookinCHAPTER19,PROBABILITYMEETSDATA,andverycloselyrelatedtohypothesistesting,discussedinCHAPTER27,HYPOTHESISTESTING:DOESCHANCEEXPLAINTHERESULTS?.

SupposetheobservedvalueofXisx.Ifpwereveryverysmall(closetozero),itwouldbeunlikelytoseexormoreonesinthesampleunlessx=0.Soseeingxonesinthesampleisevidencethatpisnottoosmall.Conversely,ifpwereveryverylarge(closetoone),itwouldbeunlikelytoseexorfeweronesinthesampleunlessx=n.SoobservingthatX=xlimitstheplausiblerangeofvaluesofp.

Supposewewantaconfidenceintervalforpwithconfidencelevel1.Letpbethesmallestvalueofqforwhich

/2P(Xxifp=q)=nCxqx(1q)nx+nCx+1qx+1(1q)nx1++nCnqn(1q)0.

Similarly,letp+bethelargestvalueofqforwhich

/2P(Xxifp=q)=nCxqx(1q)nx+nCx1qx+1(1q)nx+1++nC0q0(1q)n.

Thentheinterval[p,p+]isa1confidenceintervalforp.IntervalsconstructedthiswaycanbemuchshorterthantheconservativeintervalsbasedonChebychev'sInequalityandtheupperboundonSD(box),buttheyarestillguaranteedtoattainatleasttheirnominalconfidencelevel.Confidenceintervalsbasedonthenormalapproximationaregenerallynotmuchshorter,buttheiractualconfidencelevelcanbesubstantiallylowerthantheirnominalconfidencelevel.

ConfidenceIntervalsforPopulationPercentiles

WecanalsousearandomsamplewithreplacementtofindaconfidenceintervalforaPERCENTILEofapopulation.WeshallworkoutthedetailsfortheMEDIANotherpercentilescanbetreatedsimilarly.Unliketheconservativeandapproximateconfidenceintervalsandlikeexactconfidenceintervalsforthepopulationpercentagewejustsawandtheseintervalsarenotoftheform(estimateuncertainty).Instead,theendpointsoftheintervalsaretwoofthedata.Andthisapproachalsoleadstoexactconfidenceintervals:Thenominalcoverageprobabilityisequal[+]totheactualcoverageprobability.

Tobegin,supposewehavearandomsampleofsize10

{X1,X2,,X10}

takenwithreplacementfromapopulationwithmedianm.Sortthedataintoincreasingorder:letX(1)bethesmallestdatum,X(2)bethesecondsmallest,etc.,andletX(10)bethelargestdatum.(Thesorteddataarecalledtheorderstatistics.)LetA1betheeventthatthefourthsmallestdatum,X(4),islessthanorequaltothemedian,andletA2betheeventthattheseventhsmallestdatum,X(7),isgreaterthanorequaltothemedian.TheeventA1occursunless7ormoredataaregreaterthanthepopulationmedian,soA1cistheeventthat7ormoredataaregreaterthanthepopulationmedian.Similarly,theeventA2occursunless7ormoredataarelessthanthepopulationmedian,soA2cistheeventthat7ormoredataarelessthanthepopulationmedian.LetA=A1A2betheeventthatthefourthandseventhorderstatisticsbracketthemedian.WeshallfindalowerboundontheprobabilityofA.

Notethatifsevenormoredataarelessthanthemedian,thenitisnotthecasethatsevenormoredataaregreaterthanthemedian,soA1candA2caredisjoint.Hence,

P(Ac)=P((A1A2)c)

=P(A1cA2c)

=P(A1c)+P(A2c),

andthus

P(A)=1P(Ac)=1P(A1c)P(A2c).

WearedoneifwecanfindupperboundsforP(A1c)andP(A2c).

Recallthatthemedianisthesmallestnumberthatatleast50%ofthepopulationarelessthanorequalto.Itfollowsthattheprobabilitythatanumberdrawnatrandomfromthepopulationisstrictlylessthanthemedianisatmost50%(andpossiblyless),andthattheprobabilitythatanumberdrawnatrandomfromthepopulationisstrictlygreaterthanthemedianisatmost50%(andpossiblyless).Thedataaredrawnfromthepopulationindependently,sothenumberofdatathatarelessthanthepopulationmedianhasaBINOMIALPROBABILITYDISTRIBUTIONwithntrialsandp50%,asdoesthenumberofdatathataregreaterthanthepopulationmedian.

LetYbearandomvariablewithaBinomialdistributionwithparametersn=10andp=50%.ThusP(A1c)P(Y7),andP(A2c)P(Y7).However,P(Y7)=P(Y3),so

P(A)1P(Y3orY7)=P(4Y6).

Thustheprobabilitythattheinterval[X(4),X(7)]containsthepopulationmedianisatleastaslargeastheprobabilityofobserving4,5,or6successesin10independenttrialswithprobability50%ofsucceessineachtrialthehighlightedareainFIGURE263:

Figure263:Binomialprobabilityhistogram

Theintervalfromthefourthsmallestdatumtotheseventhsmallestdatumisthereforea65.6%confidenceintervalforthepopulationmedian.

Thesameideacanbeusedtofindconfidenceintervalsforotherpercentiles:Theprobabilitydistributionofthenumberofdatathatarelessthanthe100qthpercentileisBinomialwithnumberoftrialsequaltothenumberofdata,n,andprobabilityofsuccessatmostq,andtheprobabilitydistributionofthenumberofdatathataregreaterthanthe100qthpercentileisBinomialwithnumberoftrialsequaltothenumberofdata,n,andprobabilityofsuccessatmost1q.

Thefollowingexercisecheckswhetheryoucanfindaconfidenceintervalforapopulationmedian.

Exercise267.Considerfindinga96.5%confidenceintervalforthemedianofapopulationfromarandomsamplewithreplacementofsize15.

Theconfidenceintervalshouldgofromthe ? datumtothe?

[+Solution]

SummarySupposewehaveaprocedureforcalculatinganintervalfromeverypossiblesampleofsizenfromapopulationofsizeN(aboxofNnumberedtickets).Lettbeaparameterofthepopulation.Supposethatiftheprocedureisappliedtoarandomsampleofsizen,thechancethattheresultingintervalwillcontaint

0 1 2 3 4 5 6 7 8 9 10Selectedarea:0%

Areafrom: 0.5 to: 0.5n: 10 p: 0.5

isP%.ThentheintervalthatresultsfromapplyingtheproceduretoanyparticularrandomsampleofsizenisaP%CONFIDENCEINTERVALFORt.Oncetherandomsamplehasbeendrawn,theresultingintervaleithercovers(contains)ordoesnotcoverttheprobabilitythattheintervalcoverstiseither0or100%.TheprobabilitythattheintervalwillcovertbeforethesampleisdrawniscalledtheCONFIDENCELEVELoftheintervalafterthesampleisdrawn.Confidenceintervalsprovideanalternativetoreportingasingle"bestestimate"ofaparameterandasummarymeasureoftheuncertaintyoftheestimate.Itispossibletoconstructconservativeconfidenceintervalsforthepopulationpercentagefromsimplerandomsamplesorrandomsampleswithreplacementfrom01BOXES:Forasimplerandomsampleofsizen,thechancethattherandominterval

[kf/(2n),kf/(2n)]

coversthepopulationpercentagepisatleast11/k2,whereisthesamplepercentage,fisthefinitepopulationcorrection(Nn)/(N1),Nisthepopulationsize,andnisthesamplesize.Forrandomsamplingwithreplacement,thechancethattherandominterval

[k/(2n),k/(2n)]

includesthepopulationpercentagepisatleast11/k2.Theseareconservativeproceduresforconstructingconfidenceintervals,becausetheprobabilitythattheintervalstheyproducecoverthetruepopulationpercentagep(theactualcoverageprobability)isgreaterthantheprobabilitytheyclaim,11/k2(thenominalcoverageprobability).Theseprocedurescanbeextremelypessimistic,especiallywhenthesamplesizenislargeandwhenthetruepopulationpercentagepisfarfrom50%theintervalsthenaremuchwiderthantheyneedtobefortheactualcoverageprobabilitytobe11/k2.

Supposethattherandomsampleisdrawnwithreplacement.Whenthesamplesizenislarge,thecentrallimittheoremensuresthattheprobabilityhistogramofthesamplepercentagecanbeapproximatedaccuratelybythenormalcurve.TheexpectedvalueofthesamplepercentageispandtheSEofthesamplepercentageisSD(box)/n,whereSD(box)isthepopulationSD,(p(1p)),theSDofthelistofnumbersontheticketsinthebox.Whennislarge,theSDofthesample,s*,tendstobeanaccurateestimateofSD(box),andthechancethattherandominterval

[zs*/n,+zs*/n]

containspisapproximatelyequaltotheareaunderthenormalcurvebetweenz.Takingz=1.96,forexample,givesapproximate95%confidenceintervals.Thecoverageprobabilityofthisproceduretypicallyisnotexactlytheareaunderthenormalcurvebetweenz,butasthesamplesizegrows,thecoverageprobabilityapproachesthatarea.

Approximateconfidenceintervalsforthepopulationmeancanbeconstructedsimilarly,butthenitismorecommontouse

s=s*n/(n1)

toestimateSD(box)thantouses*.LetMdenotethesamplemean.Forrandomsamplingwithreplacement,ifthesamplesizenislarge,thechancethattherandominterval

[Mzs/n,M+zs/n]

coversthepopulationmeanisapproximatelyequaltotheareaunderthenormalcurvebetweenz.Again,thecoverageprobabilityisnotexactlytheareaunderthenormalcurvebetweenz,butitapproachesthatareaasthesamplesizegrows.

Confidenceintervalscanbeconstructedforpopulationparametersotherthanpercentagesandmeans.Forexample,onecanconstructconfidenceintervalsforpercentilesofapopulationusingthefactthatforrandomsamplingwithreplacement,thenumberofdatathatarelessthanthe100qthpercentilehasabinomialdistributionwithparametersnandp=q,andthenumberofdatathataregreaterthanthe100qthpercentilehasabinomialdistributionwithparametersnandp=1q.

KeyTermsapproximateconfidenceintervalbootstrapestimateofthestandarddeviationChebychev'sinequality

confidenceintervalconfidencelevelconservativeconfidenceintervalcoverageprobabilityexpectedvaluefinitepopulationcorrectionfnormalapproximationnormalcurveparameterpopulationmeanpopulationpercentagepopulationSDprobabilitysamplemeansamplepercentagesamplestandarddeviationsstandarddeviation(SD)standarddeviationofthesamples*

19972015.P.B.Stark.Allrightsreserved.Lastgenerated5/29/2015,8:04:49PM.Contentlastmodified21January201308:37PST.

Documents

SticiGui Confidence Intervals