25
Statistical Thinking Based on C. J. Wild and M. Pfannkuch (1999). Statistical thinking in Empirical Enquiry, International Statistical Review, 67(3):223-265. + Professor Matt Waite’s notes

Statistical Thinking - Computer Science and Engineering

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Thinking - Computer Science and Engineering

StatisticalThinkingBasedonC.J.WildandM.Pfannkuch (1999).StatisticalthinkinginEmpiricalEnquiry,InternationalStatisticalReview,67(3):223-265.

+ProfessorMattWaite’snotes

Page 2: Statistical Thinking - Computer Science and Engineering

BasicIdeas

• Thoughtprocessesinvolvedinstatisticalproblemsolving• Fromproblemformulationtoconclusions

• Afour-dimensionalframeworkforstatisticalthinkinginempiricalenquiry• Investigativecycle• Interrogativecycle• Typesofthinking• Dispositions

• Centralelement:“variation”

Page 3: Statistical Thinking - Computer Science and Engineering

Four-DimensionalFramework

Page 4: Statistical Thinking - Computer Science and Engineering

Dimension1:TheInvestigativeCycle• Concernedwithabstractingandsolvingastatisticalproblemgroundedinalarger”real”problem

• BasedonthePPDACmodel(Problem,Plan,Data,Analysis,Conclusions)

Page 5: Statistical Thinking - Computer Science and Engineering

Dimension2:TypesofThinking• Variation• Thinkingwhichisstatisticalisconcernedwithlearninganddecisionmakingunderuncertainty

• forthepurposesofexplanation,prediction,orcontrol

Page 6: Statistical Thinking - Computer Science and Engineering

Dimension2:MoreonVariation|Sources

Page 7: Statistical Thinking - Computer Science and Engineering

Dimension2:MoreonVariation|Prediction,Explain,Control

Page 8: Statistical Thinking - Computer Science and Engineering

Dimension2:SummaryonVariation

• Special-cause vs.commoncausevariation• Usefulwhenlookingforcauses

• Explained vs.unexplainedvariation• Usefulwhenexploringdata&buildingamodelforthem

• Suppositions• Variationisanobservablereality

• Somevariationcanbeexplained;othervariationcannot beexplainedoncurrentknowledge• Random variationisthewayinwhichstatisticiansmodelunexplainedvariation

• Thisunexplainedvariationmayinpartorinwholebeproducedbytheprocessofobservationthroughrandomsampling

• Randomnessisaconvenient humanconstructwhichisusedtodealwithvariationinwhichpatternscannotbedetected

Page 9: Statistical Thinking - Computer Science and Engineering

CorrelationisNOTcausation

Page 10: Statistical Thinking - Computer Science and Engineering

Dimension3:TheInterrogativeCycle• Appliesatmacrolevels

• Appliesalsoatverydetailedlevelsofthinking• Recursive• Subcyclesareinitiatedwithinmajorcycles

Page 11: Statistical Thinking - Computer Science and Engineering

Dimension4:Dispositions

• Whenauthorsbecomeintenselyinterestedinaproblemorare,aheightenedsensitivityandawarenessdevelopstowardsinformationontheperipheriesofourexperiencethatmightberelatedtotheproblem• Peoplearemostobservantinareastheyfindmostinteresting

• Engagementintensitieseachdispositionalelement

Page 12: Statistical Thinking - Computer Science and Engineering

TypesofAnalytics

• Descriptive• Describingcharacteristicsorpropertiesinthedata

• Predictive• Predictingthetypesofoutcomesgivennewsetsofdata,usuallybasedonaclassifiertrainedusinglabelled,existingdatasets

• Prescriptive• Decidingonthebestrouteoroptionordecisiontomakegivendata

Page 13: Statistical Thinking - Computer Science and Engineering

TypesofData

• Categorical (cf.wikipedia)• Variable thatcantakeononeofalimited,andusuallyfixednumberofpossiblevalues,assigningeachindividualorotherunitofobservationtoaparticulargroupor nominalcategory onthebasisofsome qualitativeproperty

• The bloodtype ofaperson:A,B,ABorO• Thestatethatapersonlivesin• The politicalparty thatavotermightvotefor• Thetypeofarock: igneous, sedimentary or metamorphic• Ordinal data?

• Numerical• Canbesubdividedintodiscretedata(thingsthatcanbecounted)andcontinuousdata(allpossiblenumbers).

• # ofchildren,age,scores,temperatures,etc.

Page 14: Statistical Thinking - Computer Science and Engineering

DescriptiveStatistics

• Therearethreemaingroupsofdescriptives• Thedistribution• Workswellwithcategoricaldata.Howmanyofeachthingisthere?

• Thecentraltendency• Onlyworkswithnumericaldata.Whatisthemean,medianandmode?

• Thedispersion• Onlyworkswithnumericaldata.Howspreadoutisthedata?

Page 15: Statistical Thinking - Computer Science and Engineering

DescriptiveStatistics:Distribution

• Groupingandcountingbycategoricaldata– groupandcountbytown,orzipcodeorsomethinglikethat• Oftencalledafrequencydistribution• Histogram

• Withnumericaldata,minimum andmaximum valuesareuseful

Page 16: Statistical Thinking - Computer Science and Engineering

DescriptiveStatistics:CentralTendency

• Mean• Averageornorm:allupallvaluestofindatotal,andthendividethetotalbythenumberofvalues

• Median• Middlevalue:Sortallvaluesintoorder,andthemedianisthemiddlevalue;ifthereare2valuesinthemiddle,findthemeanofthesetwo

• Mode• Mostfrequentvalue:Counthowmanyeachvalueappears,themodeisthevaluethatappearsthemost• Canhavemorethanonemode

Page 17: Statistical Thinking - Computer Science and Engineering

DescriptiveStatistics:Dispersion

• Mean• Averageornorm:allupallvaluestofindatotal,andthendividethetotalbythenumberofvalues

• Median• Middlevalue:Sortallvaluesintoorder,andthemedianisthemiddlevalue;ifthereare2valuesinthemiddle,findthemeanofthesetwo

• Mode• Mostfrequentvalue:Counthowmanyeachvalueappears,themodeisthevaluethatappearsthemost• Canhavemorethanonemode

Page 18: Statistical Thinking - Computer Science and Engineering

DescriptiveStatistics:Dispersion

• Range• Differencebetweenthelowestandhighestvalues• Subjecttoextremes(e.g.,outliers)

• Standarddeviation• Itistherelationthatasetofscoreshastothemean• Subjecttoskewness indistribution

• ForaGaussian/normaldistribution• 68%ofallvalueswillbewithin1standarddeviation• 95%willbewithin3standarddeviation

Page 19: Statistical Thinking - Computer Science and Engineering

DirtyData• Missing data

• Blanksinthedatabaseorspreadsheet.• Datamissingfromaperiodoftime.• Missingstates,counties,zipcodes.

• Wrong data• Wrongtype– numberswheretheyshouldbetextandviceversa• Sharpcurves– trendsthatcontinuenormallythatsuddenlyjumpinoneyear• Conflictingdatawithinadatasetoracrossdatasets(race,percentages,etc)

• Unusable data• Non-standardizeddata• Inconsistentdata• Abbreviations• Unitconsistency

Page 20: Statistical Thinking - Computer Science and Engineering

Correlation

• Pearsoncorrelationcoefficients(orPearsonproduct-momentcorrelationcoefficient)• ItisameasureofhowLINEARLYrelatedtwoentitiesare.• HowoftenisachangeinArelatedtoachangeinB?Andisthatpositiveornegative?

Page 21: Statistical Thinking - Computer Science and Engineering

Correlation:Forapopulation

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

StandarddeviationofX;standarddeviationofY

Page 22: Statistical Thinking - Computer Science and Engineering

Correlation:Forasample

Page 23: Statistical Thinking - Computer Science and Engineering

Correlation:Whatitmeans?

• Itisbasedonarangefrom-1to1.• 1=perfectpositivecorrelation• Agoesup1,Bgoesup1• Intherealworld,almostneverhappensoutsideofamistake

• 0=nocorrelationatall• 0rarelyeverhappens• NEARzerohappensallthetime

• -1=perfectnegativecorrelation• Agoesup1,Bgoesdown1• Itisjustlike1:rare,probablyamistake

Page 24: Statistical Thinking - Computer Science and Engineering

Significance:t-test

• The t-test isany statisticalhypothesistest inwhichthe teststatistic followsa Student's t-distribution underthe null hypothesis.• A t-testismostcommonlyappliedwhentheteststatisticwouldfollowa normal distribution ifthevalueofa scalingterm intheteststatisticwereknown• Whenthescalingtermisunknownandisreplacedbyanestimatebasedonthe data,theteststatistics(undercertainconditions)followaStudent's t distribution• The t-testcanbeused,forexample,todetermineiftwosetsofdataare significantly differentfromeachother

https://en.wikipedia.org/wiki/Student%27s_t-test

Page 25: Statistical Thinking - Computer Science and Engineering

Significance:p-value&nullhypothesis• Inthecontextof nullhypothesis testing:toquantifytheideaof statisticalsignificance ofevidence• Inessence,aclaimisassumedvalidifitscounter-claimisimprobable

• Theonlyhypothesisthatneedstobespecifiedinthistestandwhichembodiesthecounter-claimisreferredtoasthe nullhypothesis• i.e.,thehypothesistobenullified

• Aresultissaidtobe statisticallysignificant ifitallowsustoreject thenullhypothesis• Thestatisticallysignificantresultshouldbehighlyimprobableifthenullhypothesisisassumedtobetrue

• Therejectionofthenullhypothesisimpliesthatthecorrecthypothesisliesinthelogicalcomplementofthenullhypothesis

• Caveat:Unlessthereisasinglealternativetothenullhypothesis,therejectionofnullhypothesisdoesnot telluswhichofthealternativesmightbethecorrectone

https://en.wikipedia.org/wiki/Student%27s_t-test