Evelian Gabasova - The mysterious correlation: a detective story - Codemotion Milan 2017

Preview:

Citation preview

THEMYSTERIOUSCORRELATION

AdetectivestoryEvelinaGaba ová

@EVELGAB

CORRELATIONISNOTCAUSATION!

CORRELATIONANDCAUSATION

EVERYONELOVESANICECORRELATION!

CAUSALLANGUAGEFORCORRELATIONDATA

THINGSTOCHECKcorrelationorcausation(randomisedstudy?)smallsamplesizeresultbychancehiddencause(latentvariable)

DOSTORKSDELIVERBABIES?

Source:ClaudeCovo-Farchi

Testingthehypothesis

Matthews,R.(2000),StorksDeliverBabies(p=0.008).TeachingStatistics,22:36–38.

Storksandcountryarea

Birthrateandcountryarea

PredictingsalarywithlinearregressionCountryYearsofprogrammingexperienceTabsandspacesusageDevelopertypeandlanguageLevelofformaleducation(e.g.bachelor’s,master’s,doctorate)WhethertheycontributetoopensourceWhethertheyprogramasahobbyCompanysize

WHATHAPPENSIFWEREMOVETABSAND

SPACES?

DivingdeeperintolinearregressionFullmodelwiththeinformationontabsandspacesincludedReducedmodelwithouttheinformationontabsandspaces

Coefficientofdeterminationhowmuchvarianceinsalarycanthemodelexplain

Model Coefficientofdetermination Adjusted

Fullmodel 0.4008 0.3892

Reducedmodel 0.3938 0.3892

R2 R2adj

COLLINEARITY?

WHATCHANGEDINTHEREDUCEDMODEL?

MORESIGNIFICANTINTHEREDUCEDMODEL

YearsofprogrammingexperienceContributingtoopensourcePHP

OPENSOURCE?

Opensourcecontributorsusespacesmorethantabs

POTENTIALEXPLANATION?

Languageeffects?

Languageeffects?

LANGUAGEANDOPENSOURCE?

TABS,SPACES,OPENSOURCE&SALARY

Howdoesitfittogether?

ExploringsalarydistributionsBASEDONEXPERIENCELEVEL

What’sdifferentfortheseusers?…morestatisticaltesting

TheimportanceofversioncontrolHighersalary Lowersalary

Git 168 660

Iusesomeothersystem 17 30

Subversion 4 47

TeamFoundationServer 6 92

VERSIONCONTROLANDTABS/SPACES

## ## Pearson's Chi-squared test## ## data: .## X-squared = 258.48, df = 18, p-value < 2.2e-16

Versioncontrolandsalary

GitandSubversion

WHYISVERSIONCONTROLSOIMPORTANT?

github.com/evelinag/tabs-spaces-talk

BUTCANWETRUSTTHEDATA?

Salarydistribution

WHAT’SWRONG?

MISSINGDATA

Statisticsofmissingdatamissingcompletelyatrandommissingatrandommissingnotatrandom

DATATRAPSmissingdatawrongdata

accidentallyordeliberately

CANYOUTRUSTDATA?

INTERPRETATIONMachinelearningasaservice

EVELINAGABASOVAConsultingdatadetective

@evelgabevelinag.com