Upload
ahmedidrees1992
View
23
Download
1
Embed Size (px)
DESCRIPTION
dfgvdfvdsfvdf
Citation preview
Chapter26
ConfidenceIntervalsThischaptercontinuesourstudyofestimatingpopulationPARAMETERSfromRANDOMSAMPLES.InCHAPTER25,ESTIMATINGPARAMETERSFROMSIMPLERANDOMSAMPLES,westudiedESTIMATORSthatassignanumbertoeachpossiblerandomsample,andtheuncertaintyofsuchestimators,measuredbytheirRMSE.(TheRMSEisthesquarerootoftheexpectedvalueofthesquareddifferencebetweentheestimatorandtheparameterameasureofthetypicalsizeoftheerror.)Insteadofassigningasinglenumbertoeachsampleandreportingthesizeofatypicalerror,themethodsinthischapterassignanintervaltoeachsampleandreporttheCONFIDENCELEVELthattheintervalcontainstheparameter.Confidenceisatechnicaltermrelatedtoprobability.JustastheRMSEofanestimatormeasuresthelongrunaveragesizeoftheerrorinrepeatedsampling,buttheerrorforanyparticularsamplecouldbesmallerorlargerthantheRMSE,theconfidencelevelisthelongrunfractionofintervalsthatcontaintheparameterinrepeatedsampling,buttheintervalforanyparticularsamplemightormightnotcontaintheparameter.
Thestatement"theinterval[92%,94%]containsthepopulationpercentageatconfidencelevel90%"doesnotmeanthattheprobabilitythatthepopulationpercentageisbetween92%and94%is90%.(Theeventthattheinterval[92%,94%]containsthepopulationpercentageisnotrandom:Eitherthepopulationpercentageisbetween92%and94%,oritisnot.)Rather,thestatementmeansthatifweweretotakesamplesofsizenrepeatedlyandcomputea90%confidencelevelconfidenceintervalforthepopulationpercentagefromeachsampleofsizen,thelongrunfractionofintervalsthatcontainthepopulationpercentagewouldconvergeto90%.
Thelengthoftheconfidenceintervalandtheconfidencelevelmeasurehowaccuratelyweareabletoestimatetheparameterfromasample.Ifashortintervalhashighconfidence,thedataallowustoestimatetheparameteraccurately.Higherconfidencegenerallyrequiresalongerinterval,ceterisparibus,and,shorterintervalsgenerallyhavelowerconfidencelevels.Conventionalvaluesfortheconfidencelevelofconfidenceintervalsinclude68%,90%,95%,and99%,butsometimesothervaluesareused.Itiscrucialtoknowtheconfidencelevelassociatedwithaconfidenceinterval:Theintervalbyitselfismeaningless.
Conservativeconfidenceintervalsforpercentages
Inthissection,wedevelopconservativeconfidenceintervalsforthepopulationPERCENTAGEbasedontheSAMPLEPERCENTAGE,usingCHEBYCHEVSINEQUALITYandanupperboundontheSDofliststhatcontainonlythenumbers0and1.Conservativemeansthatthechancethattheprocedureproducesanintervalthatcontainsthepopulationpercentageisatleastlargeasclaimed.(Laterinthischapterwewillconsiderapproximateconfidenceintervals.)
Considera01BOXofNtickets.Thepopulationpercentagepisthefractionofticketslabeled"1:"
p=100%(#ticketsinthepopulationlabeled"1")/N,
ThepopulationpercentageisalsothePOPULATIONMEANofthenumbersonalltheticketsinthebox,ave(box).ThesamplepercentageofaSIMPLERANDOMSAMPLE(randomsamplewithoutreplacement)ofsizenfromthepopulationofNticketsis
=100%(#ticketsinthesamplelabeled"1")/n.
Thesamplepercentageisthesamplemeanofthelabelsontheticketsinthesample.TheEXPECTEDVALUEofthesamplepercentageisthepopulationpercentagep,andtheSEofthesamplepercentageis[+]
SE()=f(p(1p))/n
f50%/n,
wherefisthefinitepopulationcorrection
f=(Nn)/(N1).
Thusf50%/nisanupperboundontheSEofthesamplepercentage.
FIGURE261showswhathappensifwecenteranintervalatthesamplepercentage,andextendtheintervaldownandupfromthesamplepercentagebytwicetheupperboundontheSEofthesamplepercentage.Whentheintervalincludesthepopulationpercentage,wesaytheintervalCOVERSthetruth.Theintervalisrandom,becauseitiscenteredatthesamplepercentage,whichisrandom.ThechancethattherandomintervalwillcontainthetruepopulationpercentageiscalledtheCOVERAGEPROBABILITYoftheinterval.TakeafewsamplesbyclickingTakeSampletogetthefeelofthetoolthenincreaseSamplestoTaketo1000andclickTakeSampleagain.Theactualpercentageofintervalsthatcoverwillvary,butalmostalwaysitwillbelargerthan75%,sometimesnearly100%.Theempiricalpercentageofintervalsthatcoverisanestimateofthecoverageprobabilityoftheprocedure.VarythesamplesizeandputafewdifferentlistsofzerosandonesintothePopulationboxattherightofthefigure,andtryafewdifferentsamplesizesforeachpopulation.Youshouldfindthatthefractionofintervalsthatcoverthetruepopulationpercentagestaysabove75%(almostwithoutfail),nomatterwhatthepopulationofzerosandonesis.
Figure261:ConservativeConfidenceIntervalforthePopulationPercentage
Whydotheserandomintervalscoverthetruepopulationpercentagesooften?WecanshowthattheyshouldusingChebychev'sinequality.Because
SE()f50%/n,
theevent
|p|kSE()
isasubsetoftheevent
|p|kf50%/n.
Itfollowsthat
P(|p|kSE())P(|p|kf50%/n).
CHEBYCHEV'SINEQUALITYguaranteesthatthechancethesamplepercentagediffersfromitsexpectedvaluepbymorethanktimesitsSTANDARDERRORisatmost1/k2,so
11/k2P(|p|kSE())
P(|p|kf50%/n).
Thatis,
P(|p|kf50%/n)11/k2.
Therefore,inthelongruninrepeatedsampling,thefractionoftrialsinwhichthesamplepercentageiswithin2f50%/nofthepopulationpercentagepconvergestoanumberthatis75%orlarger.[+]Wheneveriswithin2f50%/nofthepopulationpercentagep,anintervalcenteredatextendingdownandupby2f50%/nwillcontainp.Thatis,theinterval
2f50%/n,
whichisshorthandfor
[2f50%/n,+2f50%/n],
containspatleast75%ofthetime,inthelongrun.Similarly,thefractionoftrialsinwhichiswithin3f50%/nofpconvergestoanumberthatis88.89%orlarger,sothelongrunfractionofintervals3f50%/nthatcontainpwillbe88.89%orlarger.Thefractionoftrialsinwhichiswithin
Samplefrom: Box withoutreplacementTakeSample HideBox
Samples:0SD(Box):0.5Ave(Box):0.5
0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
110001
SampleSize:3 Samplestotake:1 Intervals:+/2 * BoundonSE(01boxonly) 0%cover
4f50%/nofpconvergestoanumberthatis93.75%orlarger,sothelongrunfractionofintervals4f50%/nthatcontainpwillbe93.75%orlarger,etc.
Ingeneral,ifwegodownandupfromthesamplepercentagebykf50%/n,theninthelongruninrepeatedtrials,theresultingintervalswillincludethetruepopulationpercentageatleast11/k2ofthetime.
ChangetheIntervals:valueinFIGURE261to3andto4toconfirmempiricallythatthisistrue.
Theintervalkf(50%/n)israndom:Itscenterdependson,whichinturndependsonwhichUNITS(here,tickets)happentobeintherandomsample.Theprobabilityisintherandomsamplingprocedure,notintheparameter.ThePARAMETERisthesame,nomatterwhatsamplewehappentogettheparameterisapropertyofthepopulation,notthesample.Itistheintervalthatvarieswiththerandomsample.Beforethedataarecollected,thecoverageprobabilityisthechancethatsamplingwillresultinanintervalthatcontainstheparameter.
Takingthesampledeterminestheinterval,leavingnothingtochance:Theintervaltheprocedureproducedeitherdoesordoesnotcontainthepopulationpercentage.(Onecouldsaythataftercollectingthedata,thechancethattheintervalcoverstheparameteriseither0or100%.)Typically,weneverlearnwhethertheintervalcoverstheparameter,butourignoranceisnotaprobability(atleast,notaccordingtotheFREQUENCYTHEORYOFPROBABILITYusedinthisbook).
TheintervaltheproceduregivesforanyparticularsetofdataiscalledaCONFIDENCEINTERVAL.TheCONFIDENCELEVELofaCONFIDENCEINTERVALisequaltotheCOVERAGEPROBABILITYoftheprocedurebeforethedataarecollected.
CONFIDENCEisawordstatisticiansreserveforthisidea.If,beforecollectingthedata,theprocedureweareusinghasaP%chanceofproducinganintervalthatCOVERSthetruePOPULATIONPERCENTAGE,then,aftercollectingthedata,theintervaltheprocedureproducediscalledaP%CONFIDENCEINTERVAL.
CoverageProbabilityandConfidenceLevel
Considerapopulationparameter,andaprocedurethatproducesrandomintervals.SupposethattheprobabilitythattheprocedureproducesanintervalthatcontainstheparameterisP%.
1. TheprocedureissaidtohavecoverageprobabilityP%.2. Theintervaltheprocedureproducesforanyparticularsampleis
calledaP%confidenceintervalfortheparameter,oraconfidenceintervalfortheparameterwithconfidencelevelP%.
Inrepeatedsampling,aboutP%ofconfidenceintervalswithconfidencelevelP%willcontain(COVER)thePARAMETER.About(100P)%oftheintervalswillnotcovertheparameter.Foranyparticularsample,unlessthepopulationparameterisknown,wewillnotknowwhethertheconfidenceintervalcoversthePARAMETER.
CHAPTER25,ESTIMATINGPARAMETERSFROMSIMPLERANDOMSAMPLES,summarizedtheuncertaintyofanestimateofaparameterbytheMEANSQUAREDERRORorROOTMEANSQUAREDERRORoftheestimator,whicharemeasuresoftheaverageerroroftheestimatorinrepeatedsampling.Aconfidenceintervalisadifferentwayofexpressingtheuncertaintyinanestimate:arangeofvaluesthatcontainstheparameterwithspecifiedconfidencelevel.
TheinterpretationofconfidencelevelforaparticularintervalisanalogoustotheinterpretationofRMSEforaparticularvalueoftheestimate:TheRMSEisthesquarerootofthelongrunaveragesquarederroroftheestimatorinrepeatedsampling,butforanyparticularsample,theerrorcouldbelargerorsmallerthantheRMSEandwewillnotknowwhichunlessweknowthetruevalueoftheparameter.Theconfidencelevelmeasuresthelongrunfractionofintervalsthatcontaintheparameterinrepeatedsampling,butforanyparticularsample,theconfidenceintervaleitherwillorwillnotcontaintheparameterandwewillnotknowwhichunlessweknowthetruevalueoftheparameter.[+]
WecanusetheapproachdevelopedinthissectiontoconstructconfidenceintervalsforthePOPULATION
PERCENTAGEPwithothernominalconfidencelevels,byextendingtheintervalupanddownfromtheSAMPLEPERCENTAGEbylargerorsmalleramounts.Thelongertheintervals,thelargerthenominalconfidencelevelthelargerthechancethatanintervalwillcontainp.Theshortertheintervals,thesmallerthechancethatanintervalwillcontainp.Inparticular,ifwechooseksothat[+]
11/k2=P%,
thentheinterval
[kf50%/n,+kf50%/n]
isa(nominal)P%confidenceintervalforthepopulationpercentagep.
Conversely,togetanominalP%conservativeconfidenceintervalforthepopulationpercentageusingasimplerandomsample,weshouldtakeanintervalthatextendsdownandupfromthesamplepercentagebykf50%/n,with
k=(1P/100).
TheactualCOVERAGEPROBABILITYoftheinterval
[kf50%/n,+kf50%/n]
isgreaterthan(11/k2),fortworeasons.First,theSTANDARDERRORofthesamplepercentageislessthanf(50%/n)unlessthepopulationpercentagepis50%.Second,thedistributionofthesamplepercentageisthatofanhypergeometricrandomvariabledividedbythesamplesize,n,andsuchadistributioncannotattaintheboundinCHEBYCHEV'SINEQUALITY:EvenforthetrueSEofthesamplepercentage,
SE()=f(p(1p))/n,
thechancethatthesamplepercentageiswithinkSE()ofthepopulationpercentagepisgreaterthan11/k2:
P(|p|11/k2.
Asaresult,confidenceintervalsforthepopulationpercentagebasedonChebychev'sinequalityandtheupperboundof50%fortheSDofalistofzerosandonesareconservative:theactualCONFIDENCELEVELisgreaterthanthenominalconfidencelevel,(11/k2).Thenextsectiondevelopsaprocedurethatisnotconservative,butthatisapproximate:Theconfidencelevelcouldbelargerorsmallerthanthenominallevel.(Thenominalconfidencelevelisclosetotheactualconfidencelevelwhenthesamplesizenislarge.)
Apopulationpercentagecannotbelessthan0%.Ifthelowerendpointofaconfidenceintervalforapopulationpercentageisnegative,itiscompletelylegitimatetoreplacethelowerendpointbyzero:Itdoesnotdecreasetheconfidencelevel.Similarly,apopulationpercentagecannotbegreaterthan100%.Iftheupperendpointofaconfidenceintervalforapopulationpercentageisgreaterthan100%,itislegitimatetoreplacetheupperendpointby100%.Theconfidencelevelremainsthesame.Similarly,ifweareconstructingaconfidenceintervalforaquantitythatcannotbenegative(height,weight,orage,forinstance),removingnegativevaluesfromaconfidenceintervalcannotreducethecoverageprobabilityorconfidencelevel.
ConfidenceIntervalsforRestrictedParameters
Ifsomevaluesofaparameterareknowntobeimpossible,excludingthosevaluesfromaconfidenceintervaldoesnotreducetheconfidenceleveloftheconfidenceinterval.
Conversely,includingimpossiblevaluesofaparameterinaconfidenceintervaldoesnotincreasetheconfidencelevel.
Forexample,ifaconfidenceintervalforaparameterthatmustbepositivehasalowerendpointthatisnegative,thelowerendpointcanbereplaced
withzero.Theconfidencelevelremainsthesame.
Inparticular,ifthelowerendpointofaconfidenceintervalforapopulationpercentageisnegative,thelowerendpointcanbereplacedwithzero.Iftheupperendpointofaconfidenceintervalforapopulationpercentageisgreaterthan100%,theupperendpointcanbereplacedwith100%.
Wheneveryouuseaconfidenceinterval,itcrucialtoreporttheconfidencelevel.Otherwise,itisimpossibletointerprettheresult.Thechoiceoftheconfidencelevelisessentiallyarbitrary,butthechoiceshouldbemadebeforecollectingthedata.Commonvaluesoftheconfidencelevelare68%,90%,95%,and99%.Thereisatradeoffbetweenprecision(thelengthoftheconfidenceinterval),andconfidencelevel:Ceterisparibus,higherconfidencelevelsrequirelongerconfidenceintervals.
Thefollowingexercisechecksyourabilitytocomputeaconservativeconfidenceintervalforthepopulationpercentage.
Exercise261.TheenteringclassatNorthSouthcentralUniversitycontains600students.Thedean'sofficeseekstodeterminethepercentageofenteringstudentswhohavecreditcards.Thedean'sofficewilltakeasimplerandomsampleof40enteringstudents,interviewthem,andcomputethesamplepercentage.Theofficewouldliketoconstructaconservative75%confidenceintervalforthepercentageofenteringstudentswhohavecreditcards.Thecenteroftheintervalwillbethesamplepercentage.
Theintervalshouldextendupanddownfromthesamplepercentageby
Thesampleistaken,andthesamplepercentageisobservedtobe86%.
Thelowerendpointoftheconfidenceintervalshouldbe andtheupperendpointshouldbe
Theprobabilitythatthisintervalcontainsthepercentageofstudentsintheenteringclasswhohavecreditcards ?
Theconfidencelevelofthisinterval ?
[+Solution]
Conservativeconfidenceintervalsforpopulationmeansofboundedboxes
Recallthatpercentagesarejustmeansofspeciallistsofnumbers,liststhatcontainsonlyzerosandones.Wecanfindconfidenceintervalsforthemeansofmoregenerallistsofnumbers,too.
IntheprevioussectionweexploitedthefactthattheSDofa01boxisatmost1/2toconstructconservativeconfidenceintervalforthepopulationmeanofa01boxthatis,thepopulationpercentage.Theapproachcanbeusednotonlyfor01boxes,butwheneverwecanfindaboundontheSDofthebox,sothatwecanapplyChebychev'sinequality.Foranyboxofnumberedticketswhatsoever,thesamplemeanofasimplerandomsampleorrandomsamplewithreplacementisanunbiasedestimatorofthepopulationmeanofthenumbersonthetickets,andtheSEofthesamplemeanisproportionaltotheSDofthebox.
Forinstance,supposeweknowthatthenumbersontheticketsintheboxareallbetweenaandb,withab.ThenSD(box)isatmost(ba)/2.[+]Inthespecialcasethata=0andb=1,thisimpliesthattheSDofa01boxisatmost50%,aswehaveseenalready.
Thatinturnimpliesthatthemeansthatifallthenumbersinaboxarebetweenaandb,theSEofthesamplemeanofasimplerandomsampleofndrawsfromtheboxisatmostf(ba)/(2n),wherefistheFINITEPOPULATIONCORRECTION.AndtheSEofthesamplemeanofndrawswithreplacementfromtheboxisatmost(ba)/(2n).
SamplingfromaBoundedBox
Supposeallthenumbersinaboxarebetweenaandb,withab.Then:
SD(box)isatmost(ba)/2TheSEofthesamplemeanofndrawswithreplacementfromtheboxisatmost(ba)/(2n).TheSEofthesamplemeanofasimplerandomsampleofsizenfromtheboxisatmostf(ba)/(2n),wherefistheFINITEPOPULATIONCORRECTION.
WithaboundontheSE,wecanuseChebychev'sinequalitythesamewaywedidforthepopulationpercentagetogetaconfidenceintervalforthepopulationmeanofthenumbersontheticketsinabox:
ConservativeConfidenceIntervalsforthePopulationMeanofaBoundedList
Supposeallthenumbersinaboxarebetweenaandb,whereab.
Forasimplerandomsampleofsizen,thechancethattherandominterval
[(samplemean)kf(ba)/(2n),(samplemean)+kf(ba)/(2n)]
includesthemeanofthenumbersintheboxisatleast11/k2,wherefisthefinitepopulationcorrection(Nn)/(N1),Nisthepopulationsize,andnisthesamplesize.
Forrandomsamplingwithreplacement,thechancethattherandominterval
[(samplemean)k(ba)/(2n),(samplemean)+k(ba)/(2n)]
includesthemeanofthenumbersintheboxisatleast11/k2.
Inbothcases,ifthelowerendpointoftheintervalislessthana,itcanbereplacedbya,andiftheupperendpointoftheintervalisgreaterthanb,itcanbereplacedbyb.
Theseareconservativeproceduresforconstructingconfidenceintervals:theprobabilitythattheintervalstheyproducecoverthetruepopulationmeanisgreaterthantheprobabilitytheyclaim,11/k2(thenominalcoverageprobability).
Approximateconfidenceintervalsforpercentages
ConfidenceintervalsforthepopulationpercentagebasedonChebychev'sinequalityandtheupperboundof50%fortheSDoflistsofzerosandonesareconservative:Theirtrueconfidencelevelisgreaterthantheirnominalconfidencelevel,(11/k2).Wecoulduseshorterintervalsandstillhaveconfidencelevel(11/k2),orwecouldclaimaconfidencelevelhigherthan(11/k2).
Howmuchshortercouldtheintervalbe,orhowlargeaconfidencelevelcouldweclaim?Itispossibletofigurethesethingsoutprecisely,[+]butweshallfollowastandardapproximateapproachinstead,onethatwecanextendtoothersituations.WeshallusetheCENTRALLIMITTHEOREMtodevelopaprocedurethatproducesshorterconfidenceintervalsforagivennominalconfidencelevel.Thenewprocedurewillbeapproximateinsteadofconservative:thecoverageprobabilitywillbeclosetothenominalcoverageprobabilitywhenthesamplesizeislarge,butcouldbesmallerorlargerdependingonthepopulationpercentage,andcouldbequitedifferentfromthenominalcoverageprobabilityforsmallsamplesfrompathologicalpopulations.
Weshallassumethroughouttherestofthischapterthateither
thesampleisdrawnwithreplacement,orthesamplesizenismuch,muchsmallerthanthepopulationsizeN.
Withthisassumption,wecanneglecttheFINITEPOPULATIONCORRECTIONandactasiftheticketsinthesampleweredrawnindependently.(SeeCHAPTER22,STANDARDERROR.)Whentheticketsaredrawnindependently,theCENTRALLIMITTHEOREMtellsusthatasthesamplesizegrows,theNORMALCURVEisabetterandbetterapproximationtothePROBABILITYHISTOGRAMoftheSAMPLEPERCENTAGE(andtotheprobabilityhistogramoftheSAMPLEMEAN).TheNORMALAPPROXIMATIONtotheprobabilitythatthesamplepercentageisintheinterval
[p1.15(p(1p))/n,p+1.15(p(1p))/n]
isequaltotheareaundertheNORMALCURVEforthecorrespondingrangeofvaluesinSTANDARDUNITS,[1.15,1.15].Theareaunderthenormalcurvebetween1.15and1.15isabout75%:
Selectedarea:74.99%Lowerendpoint: 1.15 Upperendpoint: 1.15
Thisismuchlargerthantheboundof(11/(1.15)2)=24.4%thatCHEBYCHEV'SINEQUALITYgives.Whenthesamplepercentageiswithin
1.15(p(1p))/n
ofp,piswithin
1.15(p(1p))/n
ofthesamplepercentage,sotheprobabilitythattheinterval
I=[1.15(p(1p))/n,+1.15(p(1p))/n]
containsthepopulationpercentagepisabout75%:ThecoverageprobabilityofIisapproximately75%.
Unfortunately,wecannotconstructIfromthesamplealone:thesampledeterminesthecenterofI,buttofindthelengthofIweneedtoknowp(1p),whichistantamounttoknowingp.[+]Ifweknewp,wewouldnotbeestimatingit.
Ifthesamplesizenislarge,theSAMPLESTANDARDDEVIATIONS
s=((n/(n1))(1)),
islikelytobeclosetotheSDofthepopulationwhenthathappens,
s/n
isclosetoSE(),thestandarderrorofthesamplepercentage.Therefore,ifthesamplesizeislarge,buteitherthesampleissmallcomparedtothepopulationorthesampleistakenwithreplacement,theprobabilitythattherandominterval
[1.15s/n,+1.15s/n]
containsthepopulationpercentagepisabout75%.Thisintervalhasnotonlyarandomcenter(thesamplepercentage),butalsoarandomlength(thelengthdependsontheobservedvalueofs,andsisrandom,becauseitdependsontherandomsample).
FigureFIGURE262letsyoutrytheprocedureyourself.EachtimeyouclicktheTakeSamplebutton,asampleisdrawnwithreplacementfromthenumbersintheboxontheright(initiallysettoarandomlistofzerosandones).Thesamplesizeinitiallyissetto30.Thecontrolsatthebottomofthefigureallowyoutochangethesizeofeachsample,thenumberofsamplesthataretakeneachtimeyouclickthebutton,andthewidthoftheinterval,asamultipleoftheestimatedSEortheconservativeboundontheSE.(TheestimatedSEisS/nbecausewearesamplingwithreplacementtheboundis0.5/n.)AlabelinthebottomrightcornerreportsthefractionofintervalsthatCOVERthepopulationpercentage.Intervalsthatcoveraregreenthosethatdonotcoverarered.Asmallblackdotmarksthemiddleofeachinterval(thesamplepercentage).Ablueverticallinemarksthetruepopulationpercentagep.
Figure262:Approximateconfidenceintervalsforthepopulationmeanandpercentage
5 4 3 2 1 0 1 2 3 4 5
Samplefrom: Box withreplacementTakeSample HideBox
Samples:0SD(Box):0.49Ave(Box):0.4
00101
TakeafewsamplestogetthefeelofthetoolthenincreasetheSamplestotaketo1000,andclicktheTakeSamplebuttonagain.Theactualpercentageofintervalsthatcoverwillvary,butshouldbereasonablycloseto75%.IncreaseSamplesizeto200andtryagainthepercentageofintervalsthatcovershouldbecloserto75%.TryputtingafewdifferentlistsofzerosandonesintothePopulationboxattherightofthefigure,andtryafewdifferentsamplesizesforeachpopulation.Whenthesamplesizeislarge,thefractionofintervalsthatcoverthetruepopulationpercentagewillbeverycloseto75%.
Thefollowingexercisescheckyourabilitytocomputeconservativeandapproximateconfidenceintervalsforthepopulationpercentage,andyourabilitytodeterminewhichmethodismoreappropriate.
VideosofExercises
(Reminder:Examplesandexercisesmayvarywhenthepageisreloadedthevideoshowsonlyoneversion.)
0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
SampleSize:30 Samplestotake:1 Intervals:+/1.15 * EstimatedSE 0%cover
Exercise262.IwouldliketoknowthefractionofUCBerkeleyundergraduateswhocommutetoschoolfromtheirparents'homes.Isendemailtostudentswithcampuscomputeraccountsuntil100haveresponded3oftheresponderswerecommuters.
Anapproximate95%confidenceintervalforthefractionofUCBerkeleyundergraduateswhocommutetoschoolisfrom to
[+Solution]
Exercise263.IwouldliketoknowthefractionofhomesinAlamedaCounty,California,thathaveassessedvaluesof$700,000ormore.Itakeasimplerandomsampleofsize500fromtheAlamedaCountypropertytaxrecords(somehow).Thesamplepercentageofhomesassessedat$700,000ormoreis10%.
Anapproximate96%confidenceintervalforthepercentageofhomesassessedat$700,000ormoreisfrom to
[+Solution]
Exercise264.Arandomsamplewithreplacementofsize20wastakenfromaboxoftickets.Eachticketintheboxisnumberedeitherzeroorone.Sixoftheticketsinthesamplearelabeled"1"therestarelabeled"0."
Thesamplepercentageis
Thesamplesize ? largeenoughtojustifyassumingthatsisclosetoSD(box)andusingaconfidenceintervalbasedonthenormaldistribution.
TheSEofthesamplepercentageis ?
Aconservative70%confidenceintervalforthepopulationpercentageisfrom to
[+Solution]
Exercise265.Arestauranteurplanstochangethemenuinherrestaurant,whichspecializesingamemeats.Sheistryingtodecidewhetherornottooffervenisongoulashonthenewmenu.Eachdayforamonth,shepickspeopleatrandomastheycomeintotherestaurant,andasksthemwhethertheywouldordervenisongoulashifitwereoffered.Onbusydays,shepicksmorepeopleonquietdays,shepicksfewerpeople.Supposethatineffectshehasasimplerandomsampleof160peoplewhoeatatherrestaurant.Supposefurtherthatthenumberofdinersismuchmuchlargerthanthesample.Inthesample,118saytheywouldordervenisongoulashifitwereoffered.
Thesamplepercentageofdinerswhosaytheywouldordervenisongoulashis
Thebootstrapestimateofthepopulationstandarddeviationis
A98%confidenceintervalforthepercentageofdinerswhowouldsaytheywouldordervenisongoulashamongthepopulationofpeoplewhoeatatthatrestaurantwouldgofrom to
[+Solution]
ApproximateConfidenceIntervalsforthePopulationMean
Supposethatweseekaconfidenceintervalforthemeanofapopulation(box)ofnumbers,basedonarandomsamplefromthepopulation.TheSAMPLEMEANisanUNBIASEDestimatorofthepopulationmean(E(SAMPLEMEAN)=AVE(box)),soitisreasonabletocenteraconfidenceintervalatthesamplemean.Howwideshouldwemakeanintervalcenteredatthesamplemean,fortheintervaltohaveaspecifiedprobabilityofCOVERINGthePOPULATIONMEAN?
IfweknewtheSDofthepopulationorhadanupperboundontheSDofthepopulation,wecoulduseCHEBYCHEV'SINEQUALITYtoconstructaconservativeconfidenceintervalforthepopulationmean,aswedidearlierinthechapter:thestandarderrorofthesamplemeanis
SE(SAMPLEMEAN)=SD(box)/n,
wherenisthesamplesize.So,forexample,theCOVERAGEPROBABILITYoftherandominterval
[(samplemean)2SD(box)/n,(samplemean)+2SD(box)/n]
isatleast75%.
Typically,however,theSDofthepopulationisnotknown,sowecannotconstructthisinterval.Moreover,
typicallywecannotusetheconservativeapproachbasedonChebychev'sInequality,becausethereisnoupperboundontheSDofagenerallistofnumbersanalogoustotheupperboundof50%fortheSDofliststhatcontainonlyzerosandones.(Aswehaveseen,ifallthenumbersareboundedbetweenaandb,withab,thenSD(box)(ba)/2buttypicallywedonotknowsuchlowerandupperboundsaandb.)
However,theapproximateapproachtoconstructingconfidenceintervals,basedonthenormalcurve,worksifthesamplesizeissufficientlylarge.TheCENTRALLIMITTHEOREMtellsusthatthePROBABILITYHISTOGRAMoftheAVERAGEofndrawswithreplacementfromaboxfollowstheNORMALCURVEincreasinglywellasthenumberofdrawsnincreases.WealsoknowthatthesamplestandarddeviationsisincreasinglylikelytobeanaccurateestimateoftheSDofthepopulationasnincreases.Asaresult,theprobabilitythattheSAMPLEMEANiswithinzs/nisapproximatelythesameastheareaunderthenormalcurvebetweenzandz.Foranyfixedpopulation(box),theapproximationimprovesasthesamplesizenincreases,forrandomsamplingwithreplacement.ExampleEXAMPLE261illustratescalculatinganapproximateconfidenceintervalforthepopulationmean.Theexampleisdynamic:Itwilltendtochangewhenyoureloadthepage.
Example261:ApproximateConfidenceIntervalforthePopulationMean
Toassessscholasticperformance,astateadministersanachievementtesttoasimplerandomsampleof160highschoolseniors.Thereare40000highschoolseniorsinthestate.Themeanscoreofthestudentswhotooktheexamis104.34points,andthesamplestandarddeviationoftheirscoresis13.9points.Findanapproximate98%confidenceintervalfortheaverageofthepopulationscoresthatwouldhavebeenobtainedhadeveryhighschoolseniorinthestatebeenadministeredtheachievementtest.
Solution.Thesamplesize(160)isasufficientlysmallfractionofthepopulationsize(40,000)thattreatingthesampleasifitweredrawnwithreplacementisreasonable.Thesamplesizeissufficientlylargethatthenormalapproximationtothedistributionofthesamplemeanshouldbereasonablyaccurate,andthatthesamplestandarddeviationshouldbeclosetothestandarddeviationofthepopulation.Theareaunderthenormalcurvebetween2.326is98%:
Thus,anapproximate98%confidenceintervalwouldbecenteredatthesamplemean,andextenddownandupbyfromthesamplemeanby2.326standardunits.Theestimatedstandarderrorofthesamplemeanis
13.9/160=1.099points.
Theconfidenceintervalthusshouldextenddownandupfromthesamplemeanby
2.3261.099points,
sotheconfidenceintervalis
[101.784points,106.896points]
Thefollowingexercisechecksyourabilitytocalculateapproximateconfidenceintervalsforthepopulation
5 4 3 2 1 0 1 2 3 4 5Selectedarea:98%
Lowerendpoint: 2.326 Upperendpoint: 2.326
mean.Theexerciseisdynamic:Thequestionwilltendtochangewhenyoureloadthepage.
Exercise266.Todeterminetheaveragelifetimeoftheirlightemittingdiode(LED)lightbulbs,amanufacturertakesasimplerandomsampleof110bulbsfromamanufacturinglotof34,000bulbs.Themeanlifetimeofthebulbsinthesampleis93.49thousandhours,andthesamplestandarddeviationoftheirlifetimesis8.56thousandhours.
Anapproximate98%confidenceintervalfortheaveragelifetimeofthebulbsinthemanufacturinglotwouldextendfrom
thousandhours(low)to thousandhours(high)
.
[+Solution]
ExactConfidenceIntervalsforPercentagesWehaveseentwomethodsforconstructingconfidenceintervalsforapopulationpercentage:aconservativemethodbasedonChebychev'sInequalityandaboundonSD(box),andanapproximatemethodbasedonthenormalapproximation.Conservativemeansthatthecoverageprobabilityisatleastashighasclaimedbutcouldbesubstantiallyhigherforsomepopulations.Approximatemeansthatthecoverageprobabilityisroughlyashighasclaimedbutcouldbesubstantiallylower(orsubstantiallyhigher)forsomepopulations.Thissectiondevelopsathirdmethod,whichisexact.Exactmeansthattheprobabilitythattherandomintervalcoversthetruepopulationpercentageisjustwhatitisclaimedtobe(dependingonthevalueofitcanbeabithigher,simplybecausethebinomialdistributionisadiscretedistribution).
Theseintervalsareratherdifferentfromtheconfidenceintervalspresentedearlierinthischapter,whichwereoftheform(estimateuncertainty).Instead,eachoftheendpointsiscomputedfromthedata,separately.Theresultingintervalusuallyisnotsymmetricaroundthesamplepercentage.
Weassumethatasampleofsizenisdrawnatrandomwithreplacementfroma01box.Wewanttofindaconfidenceintervalforp,thepercentageofticketslabeled"1"inthebox.LetXbethenumberofticketsinthesamplethatarelabeled"1."Ifthetruepercentageofticketslabeled"1"inthe01boxisp,thenXhasaBINOMIALPROBABILITYDISTRIBUTIONwithparametersnandp.Wewillconstructaconfidenceintervalforpbylookingatthevaluesofpthatareplausible,giventheobservedvalueofX.TheapproachissimilartotheapproachwetookinCHAPTER19,PROBABILITYMEETSDATA,andverycloselyrelatedtohypothesistesting,discussedinCHAPTER27,HYPOTHESISTESTING:DOESCHANCEEXPLAINTHERESULTS?.
SupposetheobservedvalueofXisx.Ifpwereveryverysmall(closetozero),itwouldbeunlikelytoseexormoreonesinthesampleunlessx=0.Soseeingxonesinthesampleisevidencethatpisnottoosmall.Conversely,ifpwereveryverylarge(closetoone),itwouldbeunlikelytoseexorfeweronesinthesampleunlessx=n.SoobservingthatX=xlimitstheplausiblerangeofvaluesofp.
Supposewewantaconfidenceintervalforpwithconfidencelevel1.Letpbethesmallestvalueofqforwhich
/2P(Xxifp=q)=nCxqx(1q)nx+nCx+1qx+1(1q)nx1++nCnqn(1q)0.
Similarly,letp+bethelargestvalueofqforwhich
/2P(Xxifp=q)=nCxqx(1q)nx+nCx1qx+1(1q)nx+1++nC0q0(1q)n.
Thentheinterval[p,p+]isa1confidenceintervalforp.IntervalsconstructedthiswaycanbemuchshorterthantheconservativeintervalsbasedonChebychev'sInequalityandtheupperboundonSD(box),buttheyarestillguaranteedtoattainatleasttheirnominalconfidencelevel.Confidenceintervalsbasedonthenormalapproximationaregenerallynotmuchshorter,buttheiractualconfidencelevelcanbesubstantiallylowerthantheirnominalconfidencelevel.
ConfidenceIntervalsforPopulationPercentiles
WecanalsousearandomsamplewithreplacementtofindaconfidenceintervalforaPERCENTILEofapopulation.WeshallworkoutthedetailsfortheMEDIANotherpercentilescanbetreatedsimilarly.Unliketheconservativeandapproximateconfidenceintervalsandlikeexactconfidenceintervalsforthepopulationpercentagewejustsawandtheseintervalsarenotoftheform(estimateuncertainty).Instead,theendpointsoftheintervalsaretwoofthedata.Andthisapproachalsoleadstoexactconfidenceintervals:Thenominalcoverageprobabilityisequal[+]totheactualcoverageprobability.
Tobegin,supposewehavearandomsampleofsize10
{X1,X2,,X10}
takenwithreplacementfromapopulationwithmedianm.Sortthedataintoincreasingorder:letX(1)bethesmallestdatum,X(2)bethesecondsmallest,etc.,andletX(10)bethelargestdatum.(Thesorteddataarecalledtheorderstatistics.)LetA1betheeventthatthefourthsmallestdatum,X(4),islessthanorequaltothemedian,andletA2betheeventthattheseventhsmallestdatum,X(7),isgreaterthanorequaltothemedian.TheeventA1occursunless7ormoredataaregreaterthanthepopulationmedian,soA1cistheeventthat7ormoredataaregreaterthanthepopulationmedian.Similarly,theeventA2occursunless7ormoredataarelessthanthepopulationmedian,soA2cistheeventthat7ormoredataarelessthanthepopulationmedian.LetA=A1A2betheeventthatthefourthandseventhorderstatisticsbracketthemedian.WeshallfindalowerboundontheprobabilityofA.
Notethatifsevenormoredataarelessthanthemedian,thenitisnotthecasethatsevenormoredataaregreaterthanthemedian,soA1candA2caredisjoint.Hence,
P(Ac)=P((A1A2)c)
=P(A1cA2c)
=P(A1c)+P(A2c),
andthus
P(A)=1P(Ac)=1P(A1c)P(A2c).
WearedoneifwecanfindupperboundsforP(A1c)andP(A2c).
Recallthatthemedianisthesmallestnumberthatatleast50%ofthepopulationarelessthanorequalto.Itfollowsthattheprobabilitythatanumberdrawnatrandomfromthepopulationisstrictlylessthanthemedianisatmost50%(andpossiblyless),andthattheprobabilitythatanumberdrawnatrandomfromthepopulationisstrictlygreaterthanthemedianisatmost50%(andpossiblyless).Thedataaredrawnfromthepopulationindependently,sothenumberofdatathatarelessthanthepopulationmedianhasaBINOMIALPROBABILITYDISTRIBUTIONwithntrialsandp50%,asdoesthenumberofdatathataregreaterthanthepopulationmedian.
LetYbearandomvariablewithaBinomialdistributionwithparametersn=10andp=50%.ThusP(A1c)P(Y7),andP(A2c)P(Y7).However,P(Y7)=P(Y3),so
P(A)1P(Y3orY7)=P(4Y6).
Thustheprobabilitythattheinterval[X(4),X(7)]containsthepopulationmedianisatleastaslargeastheprobabilityofobserving4,5,or6successesin10independenttrialswithprobability50%ofsucceessineachtrialthehighlightedareainFIGURE263:
Figure263:Binomialprobabilityhistogram
Theintervalfromthefourthsmallestdatumtotheseventhsmallestdatumisthereforea65.6%confidenceintervalforthepopulationmedian.
Thesameideacanbeusedtofindconfidenceintervalsforotherpercentiles:Theprobabilitydistributionofthenumberofdatathatarelessthanthe100qthpercentileisBinomialwithnumberoftrialsequaltothenumberofdata,n,andprobabilityofsuccessatmostq,andtheprobabilitydistributionofthenumberofdatathataregreaterthanthe100qthpercentileisBinomialwithnumberoftrialsequaltothenumberofdata,n,andprobabilityofsuccessatmost1q.
Thefollowingexercisecheckswhetheryoucanfindaconfidenceintervalforapopulationmedian.
Exercise267.Considerfindinga96.5%confidenceintervalforthemedianofapopulationfromarandomsamplewithreplacementofsize15.
Theconfidenceintervalshouldgofromthe ? datumtothe?
[+Solution]
SummarySupposewehaveaprocedureforcalculatinganintervalfromeverypossiblesampleofsizenfromapopulationofsizeN(aboxofNnumberedtickets).Lettbeaparameterofthepopulation.Supposethatiftheprocedureisappliedtoarandomsampleofsizen,thechancethattheresultingintervalwillcontaint
0 1 2 3 4 5 6 7 8 9 10Selectedarea:0%
Areafrom: 0.5 to: 0.5n: 10 p: 0.5
isP%.ThentheintervalthatresultsfromapplyingtheproceduretoanyparticularrandomsampleofsizenisaP%CONFIDENCEINTERVALFORt.Oncetherandomsamplehasbeendrawn,theresultingintervaleithercovers(contains)ordoesnotcoverttheprobabilitythattheintervalcoverstiseither0or100%.TheprobabilitythattheintervalwillcovertbeforethesampleisdrawniscalledtheCONFIDENCELEVELoftheintervalafterthesampleisdrawn.Confidenceintervalsprovideanalternativetoreportingasingle"bestestimate"ofaparameterandasummarymeasureoftheuncertaintyoftheestimate.Itispossibletoconstructconservativeconfidenceintervalsforthepopulationpercentagefromsimplerandomsamplesorrandomsampleswithreplacementfrom01BOXES:Forasimplerandomsampleofsizen,thechancethattherandominterval
[kf/(2n),kf/(2n)]
coversthepopulationpercentagepisatleast11/k2,whereisthesamplepercentage,fisthefinitepopulationcorrection(Nn)/(N1),Nisthepopulationsize,andnisthesamplesize.Forrandomsamplingwithreplacement,thechancethattherandominterval
[k/(2n),k/(2n)]
includesthepopulationpercentagepisatleast11/k2.Theseareconservativeproceduresforconstructingconfidenceintervals,becausetheprobabilitythattheintervalstheyproducecoverthetruepopulationpercentagep(theactualcoverageprobability)isgreaterthantheprobabilitytheyclaim,11/k2(thenominalcoverageprobability).Theseprocedurescanbeextremelypessimistic,especiallywhenthesamplesizenislargeandwhenthetruepopulationpercentagepisfarfrom50%theintervalsthenaremuchwiderthantheyneedtobefortheactualcoverageprobabilitytobe11/k2.
Supposethattherandomsampleisdrawnwithreplacement.Whenthesamplesizenislarge,thecentrallimittheoremensuresthattheprobabilityhistogramofthesamplepercentagecanbeapproximatedaccuratelybythenormalcurve.TheexpectedvalueofthesamplepercentageispandtheSEofthesamplepercentageisSD(box)/n,whereSD(box)isthepopulationSD,(p(1p)),theSDofthelistofnumbersontheticketsinthebox.Whennislarge,theSDofthesample,s*,tendstobeanaccurateestimateofSD(box),andthechancethattherandominterval
[zs*/n,+zs*/n]
containspisapproximatelyequaltotheareaunderthenormalcurvebetweenz.Takingz=1.96,forexample,givesapproximate95%confidenceintervals.Thecoverageprobabilityofthisproceduretypicallyisnotexactlytheareaunderthenormalcurvebetweenz,butasthesamplesizegrows,thecoverageprobabilityapproachesthatarea.
Approximateconfidenceintervalsforthepopulationmeancanbeconstructedsimilarly,butthenitismorecommontouse
s=s*n/(n1)
toestimateSD(box)thantouses*.LetMdenotethesamplemean.Forrandomsamplingwithreplacement,ifthesamplesizenislarge,thechancethattherandominterval
[Mzs/n,M+zs/n]
coversthepopulationmeanisapproximatelyequaltotheareaunderthenormalcurvebetweenz.Again,thecoverageprobabilityisnotexactlytheareaunderthenormalcurvebetweenz,butitapproachesthatareaasthesamplesizegrows.
Confidenceintervalscanbeconstructedforpopulationparametersotherthanpercentagesandmeans.Forexample,onecanconstructconfidenceintervalsforpercentilesofapopulationusingthefactthatforrandomsamplingwithreplacement,thenumberofdatathatarelessthanthe100qthpercentilehasabinomialdistributionwithparametersnandp=q,andthenumberofdatathataregreaterthanthe100qthpercentilehasabinomialdistributionwithparametersnandp=1q.
KeyTermsapproximateconfidenceintervalbootstrapestimateofthestandarddeviationChebychev'sinequality
confidenceintervalconfidencelevelconservativeconfidenceintervalcoverageprobabilityexpectedvaluefinitepopulationcorrectionfnormalapproximationnormalcurveparameterpopulationmeanpopulationpercentagepopulationSDprobabilitysamplemeansamplepercentagesamplestandarddeviationsstandarddeviation(SD)standarddeviationofthesamples*
19972015.P.B.Stark.Allrightsreserved.Lastgenerated5/29/2015,8:04:49PM.Contentlastmodified21January201308:37PST.