247

Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 2: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

BusinessstatisticswithExcelandTableauAhands-onguidewithscreencastsanddata

StephenPeplow

©2014-2015StephenPeplow

Page 3: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

WiththankstothestudentswhohavehelpedmeatImperialCollegeLondon,theUniversityofBritishColumbiaandKwantlenPolytechnicUniversity.

Page 4: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 5: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Contents

1.Howtousethisbook1

1.1Thisbookisalittledifferent 2

1.2Chapterdescriptions 2

2.VisualizationandTableau:telling(true)storieswith

data9

3.Writingupyourfindings15

3.1Planofattack.Followthesesteps. 17

3.2Presentingyourwork 17

4.Dataandhowtogetit18

4.1Bigdata 21

4.2Someusefulsites 21

5.Testingwhetherquantitiesarethesame23

5.1ANOVASingleFactor 23

5.2ANOVA:withmorethanonefactor 29

6.Regression:Predictingwithcontinuousvariables... 31

6.1Layoutofthechapter 33

6.2Introducingregression 33

6.3Truckingexample 36

6.4Howgoodisthemodel?—r-squared 39

6.5Predictingwiththemodel 40

6.6Howitworks:Least-squaresmethod 41

6.7Addinganothervariable 42

6.8Dummyvariables 43

6.9Severaldummyvariables 49

6.10Curvilinearity 50

6.11Interactions 52

6.12Themulticollinearityproblem 58

6.13Howtopickthebestmodel 58

6.14Thekeypoints 59

6.15Workedexamples 60

Page 6: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.Checkingyourregressionmodel61

7.1Statisticalsignificance 61

7.2Thestandarderrorofthemodel 64

7.3Testingtheleastsquaresassumptions 66

7.4Checkingtheresiduals 67

7.5Constructingastandardizedresidualsplot 68

7.6Correctingwhenanassumptionisviolated 71

7.7Lackoflinearity 71

7.8Whatelsecouldpossiblygowrong? 73

7.9Linearitycondition 73

7.10Correlationandcausation 74

7.11Omittedvariablebias 74

7.12Multicollinearity 75

7.13Don’textrapolate 75

8.TimeSeriesIntroductionandSmoothingMethods. . 76

8.1Layoutofthechapter 76

8.2TimeSeriesComponents 77

8.3Whichmethodtouse? 78

8.4Naiveforecastingandmeasuringerror 79

8.5Movingaverages 81

8.6Exponentiallyweightedmovingaverages 83

9.TimeSeriesRegressionMethods87

9.1Quantifyingalineartrendinatimeseries usingregression87

9.2Measuringseasonality 89

9.3Curvilineardata 91

10.Optimization95

10.1Howlinearprogrammingworks 95

10.2Settingupanoptimizationproblem 96

10.3Exampleofmodeldevelopment. 97

10.4Writingtheconstraintequations 97

10.5Writingtheobjectivefunction 98

Page 7: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

10.6OptimizationinExcel(withtheSolveradd-in)...98

10.7Sensitivityanalysis 100

10.8InfeasibilityandUnboundedness 102

10.9Workedexamples 102

11.Morecomplexoptimization107

11.1Proportionality 107

11.2Supplychainproblems 109

11.3Blendingproblems 110

12.Predictingitemsyoucancountonebyone114

12.1Predictingwiththebinomialdistribution 115

12.2PredictingwiththePoissondistribution 118

13.Choiceunderuncertainty119

13.1Influencediagrams 119

13.2Expectedmonetaryvalue 121

13.3Valueofperfectinformation 123

13.4Risk-returnRatio 124

13.5Minimaxandmaximin 124

13.6Workedexamples 125

14.Accountingforrisk-preferences127

14.1Outlineofthechapter 128

14.2Wheredotheutilitynumberscomefrom? 130

14.3Convertinganexpectedutilitynumberintoacer-

taintyequivalent. 136

15.Glossary137

Page 8: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 9: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

1.HowtousethisbookIbegan business life as an entrepreneur in business in HongKong. I ran tradeexhibitions,importedcoffeefromKenya,and startedandoperatedtworestaurants,oneofwhich(unusuallyforHongKong)servedvegetarianfood.Thesebusinesseswereprofitable,butIwouldhavesavedmyselfagreatdealofstress,anddonebetterif I had used some business intelligence informed by statistics to improve mydecision-making.Lookingback,IwishIhadbeenable tothinkthroughandwriteanalysesontopicssuchastheseamongmanyothers:

a.calculationofoptimalrestaurantstaffinglevels

b.analysisofsalesovertimeandseasonalityintrends

c.receiptspercustomerbyrestauranttypeandanalyzedstrengthandtypeofanydifferences

d.predictedsalesdataforpotentialexhibitorswithvisualiza-tionsofvarious‘whatif’scenarios

e.gaineddeeperinsightsfromvisualizingmydata

f.wonovermorepartnersandinvestorswithbettervisualandwrittenpresentations

I’msurethattherearemanymorepeoplelikeme,awarethattheyoughttobedoingmorewiththedatatheycollectaspartofnormalbusinessoperations,butuncertainofhowtogoaboutit.There is no shortage of textbooks andmanuals, but thesedon’tseem toget to thehands-onapplicationsquicklyenough.Thisbook is forpeoplesuchasme,intwocomponents:theanalysis,findingouttheunderlyingstoryfromthedata,andthenthepresentationofthestory.

1

Page 10: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 11: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 12: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

1.1Thisbookisalittledifferent

Mostbooksonstatsintroducestatisticaltechniquesonachapterbychapterbasis.Instead,I’vewrittenthisbooksothatitreflects thewaymany people learn. Thechaptersarestructuredbythetypeofquestionyoumightwanttoask.Examplesofthe typeofquestionsaresummarizedbelow,andatthebeginningof eachchapter.Thechaptersthemselvesdoesn’tincludemuchmathand technicaldetails. Insteadtheseareplacedtowardstheendofthebookinaglossary.

ManyoftheExcelproceduresarelinkedtoscreencastspreparedbymetoillustratethat particular procedure. The data-sets used in the book are available from myDropboxpublicfolder:justclickonthehyperlinkthatappearsnexttoeachworkedexample.Withthedata-setsopeninExcel,youcanfollowalongatyourownpace.(And then do the same with your own data). I have used Tableau for somevisualizations,andyouwillfindlinkstomy workbooksandscreencasts.

Page 13: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

1.2Chapterdescriptions

Ifyou’rereadingthisbook,thenyouareverylikelyalreadyengagedinbusinessandwouldliketoknowhowtotakethatbusinesstonewheights.Takealookthroughthechapterdescriptionsthat followandthengostraighttothatchapter.Ifyouthinkyoumightneedabitofastatisticsrefresher,lookthroughtheGlossaryinChapter16first.AverygoodfreestatisticsOpenIntrotextbookisavailablehere¹

Chapter2introducesTableauPublic²whichisafreedatavisu-alizationtool.WhileExceldoeshavegraphingtoolswhichareeasytouse,theresultscanlookalittleclunky.Tableauhelps

¹https://www.openintro.org/stat/²http://www.tableausoftware.com/public/

Page 14: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

ustomergedatafromdifferentsourcesandcreateremarkablevisualizationswhichcanthenbeeasilyshared.I’llbe illustratingresultsthroughoutthisbookwitheitherTableau,Excelorboth.One caution:ifyoupublishyourresultstothewebusingTableauPublic,yourdataisalsopublished.Ifthisisaproblem,thereisanoption topay foranenhancedversionofTableau.ThescreenshotbelowshowsworkdonechangesonlanduseintheDeltaregionofBritishColumbia.Byclickingonthelink,youcanopentheworkbookandalterthesettings.Youcanfiltertheyear (see topright)andalsolandusetype.ThankstoMalcolmLittleforhisworkonthisproject.TableauworkbookforDeltaLandCoverChanges³

Tableaushowingchangesinlandcover

Tableauhastraininganddemonstrationvideosavailableon itswebsite, and thereare plenty of examples out there. The screen casts which Tableau provides(available at their homepage) are probably enough.Where I have found sometechnique(suchasboxplots)particularlytrickyIhavecreatedscreencastsforthisbook.Theimagebelowisdatawewilluseintheregressionchapter.Youruna

³https://public.tableau.com/views/DeltaCropCategories1996-2011/Sheet1?:embed=y&:showTabs=y&:display_count=yes

Page 15: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

truckingcompanyandfromyourlogbooksextractthedistanceanddurationandnumberofdeliveriesforsomedeliveryjobs.Theimageshowsatrendline,withdistanceonthehorizontalaxisandtimetakenontheverticalaxis.Thepointsonthescatterplotare sizedbynumberofdeliveries.Youcanseethatmoredeliveriesincreasesthetime,asonewouldexpect.Usingregression,Chapter6,wewillworkoutamodelwhichcanshowhowmuchextrayoushouldchargeforeachdelivery.

HereisaYouTubefortruckingscatterplot⁴

Tableaushowingdistances,times,anddeliveries

Chapter3.Writingupyour findings. Inmostcases statistical analysis isdone inordertohelpadecision-makerdecidewhattodo.Iamassumingthatitisyourjobtothinkthroughtheproblemandassistthedecision-makerbyassemblingandanalyzingthedataonhis/herbehalf.Thischaptersuggestswaysinwhichyoumightwriteupyourreport,accompaniedbyvisualizationstogetacrossyourmessage.Therearealsolinkstosomehelpfulwebsiteswhich

⁴https://youtu.be/oFACLZLJWZI

Page 16: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

discussthepreparationofslidesandhowtomakepresentations.

Chapter4:Dataandhowtogetit.Ifyouareconsideringcollectingdatayourself,throughasurveyforexample,thenyou’llfindthischapteruseful.So-calledbigdataisahottopic,andsoIincludesomediscussion.Thechapteralsoincludeslinkstopubliclyavail-abledatasetswhichmightbehelpful.

Chapter 5 deals with tests for whether two or more quantities are the same ordifferent.Forexample,youareafranchiseewiththree coffee-shops.Youwant toknowwhetherdailysalesarethesameordifferent,andperhapswhatfactorscauseanydifference thatyou find. Youwant to know whether there is a statisticallysignificantcommonfactorornot,toeliminatethepossibilitythatthedifferenceyouseeoccurredpurelybychance.Perhapstheaverageageofthecustomersmakesadifferenceinthesales?ANOVAusesverysimilartheorytoregression,thesubjectofthenextchapterandperhapsthemostimportantinthebook.

Chapter 6: Regression is almost certainly the most important statis- tical toolcovered in thisbook. Inoneformoranother,regression isbehindagreatdealofapplied statistics. The example presented in this book imagines you running atrucking business and wanting to be able to provide quotations for jobs morequickly. It turnsout thatifyouhavesomepastlog-bookdatatouse,such as thedistance of various trips, the time that they took and howmany deliverieswereinvolved,anaccuratemodelcanbemade.Regressioncanfindtheaveragetimetakenforeveryextraunitofdistance,aswellas other variables such as the number ofstops. This chapter shows how to create such a model and how to use it forprediction.Withsuchamodel,youcanmakequotationsreallyquickly.Thischapteralsocoversmorecomplexregressionandhowtogoaboutmodel-building.

Otheruses:youwanttocalculatethebetaofastock,comparingthereturnsofoneparticularstockwiththeS&P500.

Chapter7isabouttestingwhethertheregressionmodelswebuilt

Page 17: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

inChapters6and7are in fact anygood.Regressionappears so easy to do thateverybody does it,without checking the validity of the answers. Following theprocedures in this chapterwill help to ensure that yourwork is taken seriously.Thischapterismoreofaguidetothinkingcriticallyabouttheresultsother peoplemight have. Is there missing variable bias? For example, you notice a strongcorrelationbetweensalesofice-creamsanddeathsbydrowning.Canyouthereforesaythatreducingice-creamsaleswillmakethewatersafer?Err–no.Themissingorlurkingvariable is temperature.Both drowning deaths and ice cream sales are afunctionoftemperature,notofeachother.

Chapter8isabouttime-seriesandshowshowwecandetecttrendsindataovertime,and make predictions. The smoothing methods, such as moving averages andexponentiallyweightedmovingaver-agescoveredinthischapterarefinefordatawhichlacksseasonality andwhenonlyarelativelyshort-termforecast is needed.Whenyouhavelonger-rundataandformakingpredictions,thefollowing chapterhasthetechniquesyouneed.

Chapter 9 describes the regression-based approach to time series analysis andforecasting.Thisapproachispowerfulwhenthereissomeseasonalitytoyourdata,forexamplesalesofTVsetsshowadistinctquarterlypattern(asdo umbrellas!).

Page 18: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

SalesofTVsetsshowingaquarterlyseasonality

Usingregressionwe candetect peaks and troughs, connect them to seasonsandcalculatetheirstrength.Theresultisamodelwhichwecanusetopredictsalesintothefuture.

Chapter 10 is about optimizationormaking thebest useof a limitednumberofresources.Thisishighlyusefulinmanybusinesssituations.Forexample,youneedto staff a factory with workers of different skills and pay-levels. There is aminimumamount of skills requiredforeachclassofworker,and perhaps also aminimumnumberofhoursforeachworker.Usingoptimization,youcancalculatethenumberofhourstoallocatetoeachworker.

Chapter11Optimizationcanalsobeusedformorecomplex‘blend- ing’problems.Example:Yourunapaperrecyclingbusiness. Youtakeinpapersandotherfiberssuchascardboardboxes.Howcanyoumixtogetherthevariousinputssothatyouroutputmeetsminimumqualityrequirements,minimizeswastage,andgeneratesthemostprofit?

Another example: what mix or blend of investments would best suit yourrequirements?

Chapter 12 concerns calculating the probability of items you can count one byone.Forexample,whatistheprobabilitythatmore

Page 19: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

thanfivepeoplewillcometotheservicedeskinthenexthalf-hour?What is theprobabilitythatallofthenextthreecustomerswillmakeapurchase?WecansolvetheseproblemsusingthebinomialandthePoisson distributions.

Chapter13concernschoiceunder uncertainty: if you have a choice of differentactions,eachofwhichhasanuncertainoutcome,whichactionshouldyouchoosetomaximize the expectedmonetary value?Afarmerknowsthepayoffshe/shewillmake from different crops provided the weather is in one state or another(sunny/wet) butatthetimewhenthecropdecisionhastobemade,heobviouslydoesn’t know what the actual state will be. Which crop should he plant? Amanufacturerneedstodecidewhethertoinvestinconstructinganewfactoryatatimeofeconomicuncertainty:whatshouldhedo?

Chapter 14adds thedecision-maker’s riskprofile tohisorher decisionprocess.Mostpeopleareriskaverse,andarewillingto tradeoffsomerisk inexchangeforcertainty. This chapter shows how to construct a utility curve which maps riskattitudes,andthenprioritizethedecisionsintermsofmaximumexpectedutility.

Chapter15isaGlossaryandcontainssomebasicstatisticsinforma- tion,primarilydefinitions of terms that the book uses frequently and which have a particularmeaninginstatistics(forexample‘population’).TheGlossaryalsodiscusseswhytheinferentialtech-niquesusedinstatisticsaresopowerful,allowingusto makeinferencesaboutapopulationbasedonwhatappearstobeaverysmallsample.

UnderEforExcelintheGlossary,you’llfindsomelinkstoscreencastsonsubjectsnotdirectlycoveredinthisbook,butwhichyoumightfindhelpful.

Page 20: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 21: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

2.VisualizationandTableau:telling(true)storieswithdata

In this book, we’ll work on gaining insights from data by visual- ization andquantitativeanalysis,andthenpresentingtheresultstoothers.RecentlyapowerfulnewtoolcalledTableauhasbecomeavailable.TableauPublic¹isafreeversion.Butbecarefulwithyourdatabecausewhenyoupublishtotheweb,asyoumustdowithTableauPublic,thenyourdataalsobecomesavailabletoanybody.Ifyourequirethatyourdatabeprotected,youcanpayfortheprivateversion.Tableau9hasjustbeenmadeavailable.

Many of the worked examples in this book are accompanied by a link to acompleted Tableau workbook, showing how you could have used Tableau topresent your findings. Tableau cannot perform easily the more technicalhypothesis-testingaspectsof statisticssuchasregression,butitcanhelpyoutogetyourpointacrossclearly.TableauhasalsoprovidedausefulWhitePaper²onvisualbestpracticeswhichiswellworthreading

In business intelligence we are generally interesting in detecting patterns andrelationships.Thismightseemobvious,becauseashumanswearealwayslookingforthesephenomena.I’djust liketoaddthattheabsenceofapatternorrelationshipmightbejustasinformativeasfindingone.Anexcellentfirststepistotakealookatyourdatawithanexpositorygraphbeforemovingontomoreformaldataanalysis.Belowareexamplesofthetypesofchartsorgraphsmostcommonlyusedineitherexaminingdatafirstoff,

¹http://www.tableausoftware.com/public/²http://www.tableausoftware.com/whitepapers/visual-analysis-everyone

9

Page 22: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

justto‘seehowitlooks’.Checkinganexpositorygraphshowsup problems thatmightthrowyou off later:missing data, outliers or,more excitingly, unexpectedandintriguingrelationships.

Histograms

Histogramsbreakthedataintoclassesandshowthedistributionofthedata.Whichclasses(sometimesknownasbins)aremostcommon.Thehistogramshowsusthe‘shape’ofthedata.Aretheremanysmallmeasurements,ordoesthedatalookasthough it is normally distributed and consequently mound-shaped. SeeDistributionintheGlossaryformoreon this.Theplotbelow isof deaths in caraccidentsonaweeklybasisintheUnitedStatesover theperiod1973-1978.Thehistogramshowsonlythedistributionandnotthesequence.

From the histogram only,wecould say thatthemostcommonnumberofdeathsinaweekis between8000and 9000, be- cause thebars of the histogram arehighest there. They arehigh- est because theycount thenumberof timesin the time-period thatthere have been (forexample 8000 deathsoccurredin15weeks).

Page 23: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

What the histogramdoesn’tshowis that

deathshaveactuallybeendecreasingsince

Histogramofcardeaths,reportedweekly

Page 24: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

about1976.Thereisprobablyasimpleexplanationforthis:compul-soryseatbeltsperhaps?Thetake-homefromthisisthatexpositorygraphsare reallyeasy todo,and that you should try different methodswith the same data to tease out theinsights.

Cardeathsovertime

Scatterplots

Themostcommonandmostusefulgraphisprobably the scatter- plot. It isusedwhenwehavetwocontinuousvariables,suchas twoquantities(weightofcarandmileage)andwewanttoseetherelationshipbetweenthem.Thescatterplotisalsousefulforplottingtimeseries,wheretimeisoneofthevariables.

YoucanuseExcelforthis,andalsoTableau.InExcel,arrangethecolumnsofyourdatasothatthevariablewhichyouwantonthehorizontalxaxis is theoneon theleft.Thisistheindependentvariable.Puttheothervariable,thedependentvariable,ontheverticalyaxis.Itisusualpracticetoputthevariablewhichwethinkisdoingtheexplainingonthexaxisandtheresponsevariableon

Page 25: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

theyaxis.Here isascatterplotofEuropeanUnionExpenditureper student as apercentageofGDP:

EU%ofGDPspentonPrimaryEducation

There is a gradual increase over the years 1996 to 2011 as one would expect;educational expenditures tend to remain a relatively fixed share of the budget.However, there was a blip in 2006 which is interesting. Because the data ispercentageofGDP, thedropmight have been caused by an increase inGDP oralternativelybyanactualeducationalspendingdecrease.We’dneedtolookatGDPfiguresfortheperiod.IgotEuropeanGDPfigurespercapitaandusingTableauputthemonthesameplot,withdifferentaxesofcourse.Theresultisbelow

Page 26: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

EuropeanGDPandPrimary Expenditures

GDPdroppedaround2008/2009becauseoftheworld financialcrisis.Theblipinprimaryexpendituresisstillunexplained.

Maps

Tableaumakesthedisplayofspatialdatarelativelyeasy,providedyoutellTableauwhichofyourdimensionscontainthegeographicalvariable.Theplotbelowshowschangesingreenhousegas emis-sionsfromagricultureintheEuropeanUnion.Ilinkedworldbankdatabycountry.Theworkbookishere³.

³https://public.tableausoftware.com/views/EuropeanGHGfromAgriculture2010-2011/Sheet1?:embed=y&:display_count=no

Page 27: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Unfortunately, not all geographic entities are available for linkage in Tableau.States and counties in theUnited States are certainly available and provinces inCanada,aswellascountriesinEurope.TherearewaystoimportArcGISshapefilesintoTableau.Ashape-fileisalistofverticeswhichdelimitthepolygonsthatdefinethegeographicalentity.

GHGemissionsfromagricultureintheEuropeanUnion

Page 28: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 29: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

3.WritingupyourfindingsInchapter1,ImentionedsomeofthequestionsIwishedIhadbeenabletoanswerwhenIwasoperatingabusiness.NowIwanttosuggestwaysinwhichyoumightstructureyourresponseto suchquestions.Theactualanswerscomewhenyou’vedonethestatisticalwork(carriedoutinthefollowingchapters),butitmighthelptohaveatleastastructureonwhichyoucanbeginbuildingyourwork.

Beforeyoudoanyworkontheproblem,getstraightinyourmindexactlywhatyouarebeingaskedtodo.Youneedtobeclearandsodoesthedecision-makertowhomyouaregoingtosubmityourfindings.Hereisonewaytodothis:

Copydownthekeyquestionthatyouarebeingaskedinexactlythewaythatithasbeengiventoyou.Forexample,letussaythatyouhavebeenaskedtoanswerthisadmittedly toughquestion: ‘What factorsareimportantfor the success of a newsupermarket?’

Nowkeeprewritingitinyourownwordssothatitbecomesas clearaspossible.Makesurethateachandeverywordisclear.Whatdoes‘success’actuallymean?—inthebusinesscontextprobably‘mostprofitable’.Whatdoes‘new’mean?Isthisacompletelynewsupermarketoranewoutletofanexistingbrand?Youmightendupwith:‘whenweareplanningalocationforanewoutletforourbrand,whatfactorscontributemosttoprofitability?’Doingallthishelpsyoutoidentifythestatisticalmethodmostsuitableforthetask.Youalsocanincludeherethehypotheses(ifany)thatyouwillbetesting.

Hereisasuggestedsectionorderforyourreport.However,thisprobablywon’tbetheorderofthework.TheorderinwhichyoudothestatisticalworkisinSection3,

yourplanofattack.Sotheorder

15

Page 30: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

ofdoingtheactualworkisthatyoucarryouttheplanofattack,and thenwriteupyourresultsfollowingthesectionbysectionsequenceIamgivingyounow.

Section1.Statethequestiontobeansweredclearlyandsuccinctly,asresultoftherewriting you did above.Doing thismakes sure that youclearlyunderstand theproblemandwhat is being askedof you. Writingout thequestion and stating itclearlyrightatthebeginning of your report also ensures that the decision-maker(thepersonwhoposedthequestioninthefirstplace)andyouunderstandeachother.Youcanalsowriteinthehypothesiswhichyouaretesting.

Section2.Provideanexecutivesummaryoftheresults.Thisshouldbeperhapssixlineslongandcontainthekeyfindingsfromyourwork.Don’tputinanytechnicalwords, jargonorcomplicatedresults.Justenoughsothatabusypersoncanskimthroughitandgetthegeneralideaofyourfindingsbeforegoingmoredeeplyintoyourhardwork.

Section3.Outlineyourplanofattack ,describinghowyouhave approachedtheproblem.Youalsocanincludeherethehypotheses(ifany)thatyouwillbetesting.Atypicalplanofattackfollowslateroninthischapter.

Section4.Data.Provideabriefdescriptionofthedatathatyouhaveused.Thesourceofthedata, howyou found it,whether you think it is reliable andwhether it issufficientforthetask.Itisagoodideato includesummarystatisticsat thispoint.This includes information such as the number of observations, and what thevariablesare.

Section5.Statisticalmethodology.Describethestatisticalmethod-ologywhichyouare going to use, for exampleANOVAto testwhetherornot themeansare thesame.

Section6.Carryoutthetests,makingsuretoincludeadescriptionofwhetherornotthehypothesescanberejected.Thisis thecentralpartofthereport.Justmakesuretofocusonansweringthequestion.

Page 31: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Section7.Writeaconclusionwhichdescribestheresultsof the test and how theresults answer the question that was asked. You can also include here anyshortcomingsinthedataorinyourmethodologywhichmightaffectthevalidityofthe results. The conclusion is very important because it ties together all theprevioussections.HereyoucouldincludealinktoaTableau‘story’asanadditionaloralternativewayofpresentingresults.

Section8.References/end notes or further information, especially on sources ofdata, should end the document. If youwant tomake your document look reallygood, and you have lots of references, consider using the Zotero plug-in forFirefox.Itisanexcellentfreewaytomanageyourbibliography.

Page 32: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

3.1Planofattack.Followthesesteps.

a.Identificationofthestatisticalmethodthatyouwilluseandwhyyouchosethosemethods

b.Whatdataisrequired,andwherecanitbefound?

c.Conductthestatisticaltests

d.Discusswhethertheresultsarereliable/answerthequestion.(Ifnot,startagain!)

Page 33: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

3.2Presentingyourwork

While it is probably best to concentrate on written workbecause the decision-makermightwant toreadyourworkindetail, and discuss it with colleagues, aTableau presentation is an excellent way ofmaking your point over again. SeeChapter2formoreonvisualizationandTableau.

Here is a link to some excellent notes by Professor Andrew Gelman on givingresearchpresentations¹

¹http://andrewgelman.com/2014/12/01/quick-tips-giving-research-presentations/

Page 34: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 35: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

4.DataandhowtogetitInthischapter,we’lllookatthedatacollectionprocess, theprob- lemsthatmightaccompanypoorlycollecteddata,discussso-calledbigdataandprovide links tosomeusefulpublicsourcesofdata.

As you might expect, this book works through the analysis of numbers —quantitativedata—butitisimportanttonotethattheanalysisofqualitativedataisarapidlygrowingarea.Unfortunately theanalysisofqualitativedataisbeyondthisbookandthesoftwareavailabletous.

Quantitativedatacomesfromtwomainsources. Primarydataiscollectedbyyouorthecompanyyouareworkingfor.Forexample,themarketresearchyoudotofindoutthepossiblemarketshareforyourproductprovidesprimarydata.Primarydataincludes bothinternalcompanydataanddatafromautomatedequipmentsuchaswebsitehits.Collectingdata is expensive and highly proprietary. It is thereforeunlikelytobepublishedandavailableoutsidetheenterprise.

By contrast, Secondary data is plentiful and mostly free. It is collected bygovernmentsofallsizes,andalsobymany non-governmentorganizationsaswell.There is an increasing trend towards the liberation of data under various opengovernmentinitiatives.Governmentdataisusuallyreliable,butbesuretochecktheaccompanyingnoteswhichwarnofanyproblems, suchas limitedsamplesizeorchangeinclassificationorcollectionmethods over time.Thereare some links tosecondarydataattheendofthischapter.

18

Page 36: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Experimentaldesign

Experimentsdon’tnecessarilyhave tobeconducted in laboratories bypeople inwhitecoats:asurveyinshoppingmallisalsoaformofexperiment,asisanalyzingthe results of your favorite cricket team. There are two different designtypes: experimental and observational . The keydifference is the amount ofrandomization thatisbuiltintotheexperiment.Ingeneral,morerandomizationisgood,becauseithelpstoremovetheinfluenceoffixedeffects,suchasthequalityofthesoilinaparticularlocation.

Here’sanexampletomakeclearthedifference.Imaginea re-searcherwantingtotest the effect of a new drug onmice. In an experimental design, the mice areexamined before the drug is administered, and then the drug is applied to atreatmentgroupofmiceandacontrolgroup.Thecontrolgroupreceivesnodrugs.Itsjobistoactasareferencegroup.Allthemicearekeptinidenticalconditionsapartfromtheapplicationofthedrug.Theallocationofmicetothetreatmentgroupandtothecontrolgroupisentirelyrandom.

Whilewe can (anddo) carryout drug experiments onmice, itwouldclearlybeverywrongtotrytodothesamewithhumansassubjects.Insteadwecollectdataorobservationsandthenanalyzethoseobservationslookingford i ffe rences .

Tocompensateforthelackofrandomization,wecontrolforob-serveddifferencesbyincludingasmanyrelevantvariablesaspossi-ble.IfweknewthatpersonXhadreceivedaparticulardrugandhaddevelopedaparticularcondition,wewouldwanttocomparepersonXwithsomebodyelsewhohadnotdeveloped that condition.Relevantvariablesthatwewouldwanttoknowmightbe age,genderandpossiblypre-existinghealthconditions.Includingthese variables reduces fixedeffectsandallowsustoconcentrateontheeffectofthedrug.

Page 37: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Problemswithdata

It is obvious that to be credible, your analyses must be based on reliable data.Problemswithdataareusuallyconnectedtopoorsamplingandexperimentaldesigntechniques,especially:

Samplesizetoosmall .Therelativesizeof thesample to thepopulationusuallydoesnotmatter.Itistheabsolutesizeofthesamplethatcounts.Youusuallywantatleastseveralhundredobservations.

Population of interest not clearly defined. It is clear that we need to take asamplefromapopulation,butwhatexactlyisthepopulation?Here’sanexample.Youwanttosurveyshoppersinashoppingmallregardingyournewproduct.Butofthepeopleinside themall,whoexactlyareyourpopulation?Peoplejustentering,peoplejustleaving,peoplehavingtheirlunchinthefoodcourt? Singles, couples,elderlypeopleortheteenagershanging aroundoutsidethedoor?Youcanseethatpickinganyoneofthesegroupsonitsownwillleadtoabiasedsample.

Non-response bias . Many people don’t answer those irritating telephone callswhich come in the evening because they’re busy withdinner.As a result, onlyanswers from those who do choose to answer the survey are counted. Thoserespondentsmostlikelyaren’trepresentativeof thepopulation.Perhaps they livealoneordonothavetoomuchtodo.I’mnotsayingthattheyshouldnotbe in thesample,justthatincludingonlythosewhodorespondmaybiasyoursample.

Voluntaryresponsebias .Ifyoufeelstronglyaboutanissue,then you aremorelikelytorespondthanifyouareindifferent.That’ssimplyhumannature.Asaresult,thesurveyresultswillbeskewedbytheviewsofthosewhofeelmostpassionately.Thisishardlyarepresentativesamplebecausethestrongly-heldviewsdrownoutthemoremoderate voices.

Page 38: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 39: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

4.1Bigdata

Primarydatafrequentlycomesfromautomatedcollectiondevices,suchasscanners,websites,socialmedia,andthelike.Thevolumeofsuchdataisenormous,andisaptlycalledbigdata.Bigdataisthetermusedtodescribelargedatasetsgeneratedbytraditionalbusi-nessactivitiesandfromnewsourcessuchassocialmedia.Typicalbig data includes information from store point-of-sale terminals, bank ATMs,FacebookpostsandYouTubevideos.

One of the apparently attractive features of big data is simply its size, whichsupposedlyenablesdeeperinsightsandrevealsconnectionswhichwouldnotappearinsmallersamples.Thisargu-mentneglectsthepowerofstatistics,andinparticularinferential statistics. A small sample, properly collected, can yield superiorinsightstoaverylargepoorlycollectedsample.Thinkofitthisway:whichisbetter:averylargesampleinwhichalltherespondentsareinthesameage-groupandofthesamegender;orasmalleronewhichmoreaccuratelyreflectsthepopulation?

Page 40: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

4.2Someusefulsites

YoucanofcourseeasilyjustGooglefordata,orlookatthesemorefocusedsites:

Gapminderdata:freetousebutbesuretoattribute¹Workeremploymentandcompensation²

Interestingandwide-ranginghistorical data³GooglePublicData⁴

¹http://www.gapminder.org/data/²http://www.bls.gov/fls/country/canada.htm³http://www.historicalstatistics.org/

⁴http://www.google.com/publicdata/directory

Page 41: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

InternationalMonetaryFund⁵

FoodandAgricultureOrganisation⁶UnitedNationsData⁷

TheWorldBank⁸

⁵http://www.imf.org/external/data.htm

⁶http://www.fao.org/statistics/en/

⁷http://data.un.org/

⁸http://databank.worldbank.org/data/home.aspx

Page 42: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 43: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

5.Testingwhetherquantitiesarethesame

This chapter concerns testing whether the population means of two or morequantitiesarethesameornot:andiftheyareinfactdifferent,whetheranyvariablecanbeidentifiedasbeingassociatedwith thedifference.The test we will use isANOVA,(AnalysisofVariance).ThetestwasdevelopedbytheBritishstatisticianSirRonaldFisher,andtheF-testwhichANOVAusesisnamedinhishonor.FisheralsodevelopedmuchofthetheoreticalworkbehindexperimentdesignduringhistimeattheRothamstedResearchStationinEngland.

Page 44: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

5.1ANOVASingleFactor

ThemoststraightforwardapplicationofANOVAiswhenwesimply want to testwhetherornottwoormoremeansarethesame.Inthisworkedexample,wehavethreedifferenttypesofwheatfertilizer(Wolfe,WhiteandKorosa)andwewouldliketoknowwhethertheirapplicationproducesequalordifferentyields.

AsthesectiononexperimentaldesigninChapter3emphasized,theexperimentmustbe designed to isolate the effect of the fertilizer, and this is achieved byrandomization. To control for differences in site-specific growing qualities, weselectplotsoflandwhichareassimilaraspossible:exposuretosunlight,drainage,slopeandother relevantqualities.Werandomlyassignfertilizerstoplots,andtheyieldsaremeasured.Thefirstfewlinesofatypicaldatasetappearbelow.

23

Page 45: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Thefertilizerdataset

There are two columns in the dataset: the factor , which is the name of thefertilizer,andtheyieldattributedtoeachplot.Inthiscase,eachfertilizerwastestedonfourplots,providing3x4=12observations.Later,whenusingExceltoruntheANOVA test, it will be necessary to change the format so that the yields aregroupedundereachfactor.

Agoodfirststepistovisualizethedata.HereisaboxplotdrawnwithTableau.

Page 46: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Fertilizer boxplot

Thedatasetishere:fertilizerdataset¹

andhereisaYoutubeofcreatingaboxplotinTableau².

¹https://dl.dropboxusercontent.com/u/23281950/fertilizer.xlsx ²http://youtu.be/3QohthWXp1M

Page 47: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

I’llmakeafrankadmissionrightnow:ittookmethebestpartofamorningtogetthetechniqueofdrawingaboxplotinTableaudown, sohopefully thisYouTubewillhelpyoutoavoidspendingsomuchtimeonthis.

The boxplot reveals these useful measurements: the smallest and largestobservation,ortherange.Themedian,whichisthehori-zontallinewithinthebox;andthe25%quartile(upperlineofthebox;and75%quartile,lowerlineofthebox.Therefore50%(75-25)of thedataarecontainedwithin the box. The ‘whiskers’markoffdatawhich are outliers,meaning that any datapointwhich is outside awhiskerisanoutlier.Thisdoesn’tmeantosaythatit issomehowwrong:perhapssomeofthemostinterestingdiscoveriescome fromlookingatoutliers.However,an outliermight also be the result of carelessdataentryandshouldthereforebechecked.Itisclearfromtheplotsthattheyieldsarebynomeansthesame.

Inthisexample,wearemeasuringonlyonefactor :theeffectofthefertilizeronthemeanyieldofeachplot,andsothetestwewanttoconductissinglefactorANOVAwith a completely randomized design. It is completely randomized because theallocationoffertil-izertoplotwasrandom.Wewantanydifferencestobeduetothefertilizerandthefertilizeralone.

Becauseourtestiswhetherthemeansarethesameordifferent,thehypothesisis:

Ho : µ 1= µ 2= µ 3= .. =µ n

withthealternativehypothesisthatnotallthemeansarethesame. That isn’t thesameassayingthattheyarealldifferent;justthatat leastoneisdifferentfromtheothers.

Therejectionrulestatesthatifthepvaluewhichcomesoutofthe testissmallerthan0.05,thenwerejectthenullhypothesis.Ifwe reject the null, thenwemustacceptthealternativehypothesis.

Page 48: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

WetestthehypothesisusingtheANOVASingleFactortoolwithinDataAnalysis.First,thetwo-columnstructureofthedata hastobetransformedintocolumnsforeachofthethreefertilizers.ThatiseasilyaccomplishedusingthePIVOTTABLEfunction.Thisyoutube³takesyouthroughtheprocessofchangingthestructureofthedataandrunningtheANOVAtest.Theresultarebelow.

ANOVAoutputforfertilizertest

Thekeystatisticisthep-value.Itismuchsmallerthan0.05andsowerejectthenullhypothesis.Themeansarenotallthesame.Theyaredifferent.ThesummaryoutputtellsusthatWhitehasthelargestmeanyieldandthisagreeswiththeboxplot.Inthiscase,separationofresultsintoarankingofyieldisrelativelyeasybecausetheyaresodistinct.UnfortunatelyExcellacksawayofeasilytestingwhetheranyotherpairarethesameordifferent.Allwecansayforsureisthattheyarenotallthesame.

Here is a slightlymore complicated and realistic example.You are designing anadvertisingcampaignandyouhavemodelswithdifferenteyecolors:blue,brown,andgreen,andalsooneshot in

³http://youtu.be/wevrSWYBl8U

Page 49: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

whichthemodelislookingdown.Youmeasuretheresponsetoeach arrangement.Doestheeyecoloraffecttheresponse?Thedatasetiscalledadcolor⁴.Thefirstfewlinesarehere:

Thefirstfewlinesoftheadcolordataset

Thecoloristhefactor,andwe’llneedtousePIVOTTABLEtotabulatethedatasothattheeyecolorbecomesthecolumns.I’vedonethathere….

Theresultis

Theeyecolorresults

⁴https://dl.dropboxusercontent.com/u/23281950/adcolour.xls

Page 50: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Thepvalueis0.036184,smallerthan0.05.Asaresultwecanstatethatdifferentcolors are associated with different responses. Looking at the summary output,greenhasthehighestaverageat3.85974,sowecouldprobablyselectgreen.

Page 51: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

5.2ANOVA:withmorethanonefactor

TheFertilizertestaboveshowedhowtotesttheeffectofasinglefactor.TheANOVAresultsshowedthatthemeanswerenot the same.By inspecting the boxplot andalsotheExceloutput, it isclearthatthevarietyWHITEhasahigheryield.Whatmight be interesting is finding whether another factor also has a statisticallysignificanteffectonyields,andwhetherthetwofactorsinteractedtogether.

ThecaseinquestioninvolvespreparationfortheGMAT,anexamrequiredbysomegraduate schools. The GMAT is a test of logical thinking and is therefore notdependentonspecificpriorlearning.Weknowthetestscoresofsomeapplicants,andwhetherthosestudentscamefromBusiness,EngineeringorArtsand Sciencesfaculties.Thestudentshadalsotakenpreparationcourses,ranging froma 3-hourreview to a 10-week course. The question is: did taking the preparation coursematter;anddidfacultymatter?Herewehavetwofactors:facultyandpreparationcourse.Thedataisalreadyarrangedincolumnsandsowecangostraightinwithatwo-waywithreplication.Lookatthedata⁵.Itisreplicatedbecause there are twosetsofobservationsforeachpreparationtype.Theoutputishere:

⁵https://dl.dropboxusercontent.com/u/23281950/testscores.xlsx

Page 52: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

TestscoresANOVAoutput

Wehavethepreparationtypeintherows,andtherelevantpvaluethereis0.358,sowefailtorejectthenull.Wecannotsaythatthereisanydifferenceinscoresasaresult of preparation course type; in other words, there is no effect on scoresresultingfrompreparationcoursetype.However,forfaculty(incolumns) there isdistinctdifference,withapvalueof0.004.Fromthesummary output, it looks asthoseintheEngineeringfacultyhadthehighestscore.

Page 53: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 54: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.Regression:Predictingwithcontinuousvariables

This chapter is about discovering the relationships that might or (equallywell)mightnotexistbetweenthevariables in thedataset ofinterest.Here’satypicalifrathersimplisticexampleof regression in action: the operator of some deliverytruckswantstopredictthe average time taken by a delivery truck to complete agivenroute.Theoperatorneedsthisinformationbecausehechargesbythehourandneedstobeabletoprovidequotationsrapidly.Thecustomerprovidesthedistancetobedriven,andthenumberofstopsenroute.Thetaskistodevelopamodelsothatthe truck operator can predict the time takengiven that information.Regressionprovides a mathematical ‘model’ or equation from which we can very quicklypredictjourneytimesgivenrelevantinformationsuchasthedistance,andnumberofstops.Thisisextremelyusefulwhenquotingforjobsorforauditpurposes.

Weknow the sizeof the inputvariables—thedistance andthedeliveries,butwedon’tknowtherateofchangebetweenthemandthedependentorresponsevariable:whatistheeffectontimeofincreasingdistancebyacertainamount?Oraddingonemorestop?Usingthetechniqueof regression ,wemakeuseofa setofhistoricalrecords,perhaps the truck’s log-book, tocalculate the average time takenby thetrucktocoveranygivendistance.Aswithanyattempttopredictthefuturebasedonthepast,thepredictionsfromregressiondependonunchangedconditions:nonewroadworks(whichmightspeedupordelaythejourney),nochangeintheskillsofthedriver.

31

Page 55: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

In the language of regression, the time taken in the truck example isthe response or dependent variable, and the distance to be driven is theexplanatory or independent variable. While we will only ever have just oneresponsevariable, thenumberof explanatoryvariablesisunlimited.Thenumberofstopsthetruckhastomakewillimpactjourneytime,aswillvariablessuchastheageofthetruck,weatherconditions,andwhetherthejourneyisurbanorrural.Ifweknow these variables, we also can include them in themodel and gain deeperinsightsintowhatdoes(andequallyimportantdoesnot)affectjourneytime.

Regressionmodelsareusedprimarilyfortwotasks:toexplainevents in thepast;and tomakepredictions.Regression providescoefficientswhichprovidethesizeanddirectionofthechangein theresponsevariableforaoneunitchangeinoneormoreoftheexplanatory variables.

Regressionmakesaprediction forhowlonga truckwill take to make a certainnumberof deliveries and also states how accurate thepredictionwillbe.Witharegression model in hand, we can make quotations accurately and quickly. Inaddition,wecandetect anomaliesinrecordsandreportsbecausewecancalculatehowlongajourneyshouldhavetakenundervariouswhat-ifconditions.

Here I have used a straightforward example of calculating a timeproblemforatrucking firm, but regression is much more powerful than that. Some form ofregressionunderliesagreatdealofappliedresearch.Whenyoureadaboutincreasedriskduetosmoking(orwhateverelseisthelatestdanger)thatriskwascalculatedwith regression. In this chapter,we’ll calculate the differencebetween usedandnewmarioKartsalesonebay,estimatestrokeriskfactors,andgenderinequalitiesinpay,allwithregression.

Regression isnotverydifficult todo,but theproblemis thateverybodydoes it.Ordinary Least Squares (OLS), linear regression’s proper name, rests on someassumptionswhichshouldbecheckedforvalidity,butoften aren’t.We cover theassumptions,andwhat

Page 56: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

todoifthey’renotmet,inthefollowingchapter.Thischapterisaboutthehands-onapplicationsofregression.

Page 57: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.1Layoutofthechapter

The chapter begins with some background on how regression works. We’llillustratethetheorywiththetruckingexamplemen-tionedabove,beforegoingontoaddingmoreexplanatoryvariablestoimprovetheaccuracyoftheprediction.

Particularly useful explanatory variables are dummy variables,which take on acategoricalvalues,typicallyzeroor1.Forexample,wecouldcodeemployeesasmaleandfemale,anddiscoverfromthiswhetherthereisagenderdifferenceinpay,andcalculatetheeffect. Inaworkedexample in the text,weanalyze the salesofmarioKartonebay,andusedummyvariablestofindtheaveragedifferenceincostbetweenanewandausedversion.

Sometimestwovariablesinteract:higherorlowerlevelsinonevariablechangetheeffectofanothervariable.Inthetext,weshowthattheeffectofadvertisingchangesasthelistpriceoftheitemincreases.

Finally,we’lldiscusscurvilinearity .Therelationshipbetweensalaryandexperienceisnon-linear.Aspeoplegetolderandmoreexpe-rienced,theirsalariesfirstmovequicklyupwards,andthenflattenoutorplateau.Wecancapturethatnon-linearfunctiontomakepredictionmore accurate.

Page 58: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.2Introducingregression

Atschoolyouprobablylearnedhowto calculate the slopeofa line as‘riseoverrun’.Let’ssayyouwanttogotoParisforavacation.Youhaveuptoaweek.Therearetwomainexpenses,theairfareand thecostofaccommodationpernight.Theairfareisfixedand

Page 59: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

staysthesamenomatterwhetheryougoforonenightorseven.Thehotel(plusyourmealchargesetc)is$200pernight.Youcouldwriteupasmalldatasetandgraphitlikethis:

TotalParistripexpenses

Theequationforis:Totalexpenses=1000+200*Nights

Basically,youhave justwrittenyour first regressionmodel. Themodelcontainstwocoefficientsofinterest:

• theinterceptwhichisthevalueoftheresponsevariable(cost)whentheexplanatoryvariableiszero.Heretheinterceptis

$1000.Thatistheexpensewithzeronights.Itisthepointon theverticalyaxiswherethetrendlinecutsthroughit,wherex=0.Youstillhavetopaytheairfareregardlessofwhetheryoustayzeronightsormore.

• theslopeof the linewhich is$200.Foreveryoneunit increase in theexplanatoryvariable(nights)thedependent variable (total cost) increasesbythisamount.

In regression, we usually know the dependent variable and the independentvariable, but we don’t know the coefficients.Weuse regressionto extract thosecoefficientsfromthedatasothatwecanmakepredictions.

Howdoweknowwhethertheslopeoftheregressionlinereflectsthedata?Trythisthoughtexperiment:ifyouplottedthedataand

Page 60: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

thetrendlinewasflat,whatwouldthatmean?Norelationship.You couldstay inParisforever,free!Aflatlinehaszeroslope,andsoanextranightwouldnotcostanymore.

More formally, thehypothesis testingprocedure tests thenullhypothesisthat theslopeiszero.Ifwecanrejectthenull,thenwearerequiredtoacceptthealternativehypothesis,whichisthattheslopeisnotzero.Ifitisn’tzero(thetrendlinecouldslopeupordown)thenwehavesomething toworkwith.We test the hypothesisusingthedatafoundfromthesample,andExcelgivesusapvalue.Ifthepvalueissmallerthan0.05,werejectthenullhypothesis.Ifwerejectthenullwecansaythatthereisaslopeandtheanalysisisworthwhile.

The Paris example had only one independent variable (number of nights) butregression can includemanymore, as the trucking examplebelowwillshow.Ifthereismorethanonevariable,wecannotshowtherelationshipwithonetrendlineintwodimen-sions.Instead,therelationshipisahyperplaneorsurface.Theplotbelow shows the effect on taste ratings (the dependent variable) of increasingamountsoflacticacidandH2Sincheesesamples.

Page 61: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

3Dhyperplaneforresponsestocheese

Hereisanotherexample.Wehavethedataonthesizeof thepopulationofsometowns and also pizza sales. How can we predict sales given population. PizzaYouTubehere¹

Page 62: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.3Truckingexample

The records of some journeys undertaken are available in an Excel file named‘trucking’².

Takealookatthedata.AsImentionedatthebeginningofthebook, itisalwaysagoodideatobeginwithatleastasimpleexploratoryplotofthevariablesofinterestinyourdata.WithExcelwecanquicklydrawaroughplotsothatwecanseewhatisgoingon.Whatwewanttodoispredicttimewhenweknowthedistance.The

¹https://www.youtube.com/watch?v=ib1BfdVxcaQ²https://dl.dropboxusercontent.com/u/23281950/trucking.xlsx

Page 63: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

scatter plot below shows time as a function of distance.YouTube: scatterplot oftrucking data³

Trucking scatterplot

Time is the dependent variable and distance is the independent variable. Theindependentvariablegoesonthehorizontalx axis,thedependentvariableontheverticalyaxis.

Fromtheplotitisclearthatthereisapositiverelationshipbetweenthetwovariables.Asthedistanceincreases,thensodoesthetime.Thisishardlyasurprise.Butwewantmorethanthis:wewanttoquantifytherelationshipbetweenthetimetakenandthedistance traveled.Ifwecanmodel this relationship thatwill be usefulwhencustomersaskfor quotations.

Thedatasetcontainsonefurthervariable,whichisthenumberofdeliveriesontheroute.We’llusethatlateron.Fornowwe’llusejustthetimeandthedistance.

Tobuildourmodel,wewanttobuildanequationwhichlookslikethisinsymbolicform

³https://www.youtube.com/watch?v=HSpY12mLXoU

Page 64: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

y = b o + b 1 x 1+…b n x n +e

yhat(spoken‘yhat’) is the dependent variable. It is called ‘hat’ because it isanestimate. b0 is the intercept, or the point where the trend line (also called theregression line) passes through the vertical y axis. This is the point where theindependentvariableiszero.Youperhapsrememberasimilarequationfromschool:y=mx

+b.

Inthetruckingexample,theintercept(b0)mightbethe timetakenwarmingupthetruckandcheckingdocumentation. Timeis running,but the truck isnotmoving.b1 is the ‘coefficient’fortheindependentvariablebecauseitprovidesthechangeinthedependentvariableforaone-unitchangeinthe independentvariable.x1istheindependentvariable,inthiscasedistance.Wewanttobeabletopluginsomevalueoftheindependentvariableandgetbackapredictedtimeforthatdistance.

Thebandthexbothhaveasubscriptof1becausetheyarethefirst(andsofaronlyindependentvariable).Ihaveputinmorevariables just to indicate thatwe couldhavemany.

Thecoefficient,b1,providestherateofchangeofthedependentvariableforaoneunitchangeintheindependentvariable.Youcanthinkofthisastheslopeoftheline:asteeperslopemeansagreaterincreaseinyforaone-unitincreaseinx.

WecannowfindtheestimatedregressionequationusingtheregressionapplicationinExcel.First,weusetheregressfunctioninExcel’sdataanalysistooltoregressdistanceontime.YouTube⁴

Theregressionoutputisasbelow:

⁴https://www.youtube.com/watch?v=xKsYfa7YGgE

Page 65: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Theregressionoutput

Ihavemarkedsomeofthekeyresultsinred,andIexplainthembelow.

Page 66: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.4Howgoodisthemodel?—r-squared

Thereisaredcirclearoundtheadjustedr-squaredvalue,here0.622.r-squaredisameasureofhowgoodthemodelis.r-squaredgoesfrom0to1,witha1meaningaperfectmodel, which is extremelyrareinappliedworksuchasthis.Zeromeansthereisnorelationshipatall.Theresulthereof0.62meansthat62%ofthevariancein the dependent variable is explained by themodel. This isn’t bad, but it’s notgreat either. Below,we’ll add more independent variables and show how theaccuracy of the model improves.Theadjusted r-squared takes into account thenumberofvariablesintheregressionequationandalsothesamplesize,whichiswhyit is slightly smaller than the r-squared value. It doesn’t have quite the sameinterpretationasr-squared,butitisveryusefulwhen comparing models. Like r-squared,wewantashighavalueaspossible:higherisbetterbecauseitmeansthatthemodelisdoingamoreaccuratejob.

Alsocircledarethewordsinterceptanddistance.Thesegiveusthe

Page 67: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

coefficientsweneedtowritethemodel.ExtractingthecoefficientsfromtheExceloutput,wecanwritetheestimatedregressionmodel.

y =1 . 274+0 . 068∗Dist

whereyhatisthe estimated or predicted response time. If you took avery largenumberofjourneysofthesamedistance,thiswouldbetheaverageofthetimetaken.

Theinterceptisaconstant,itdoesn’tchange.Itistheamountoftimetakenbeforeasinglekilometerhasbeendriven.Thedistancecoefficientof0.068 is thepieceofinformationthatwe reallywant.Thisistherateofchangeoftimeforaoneunitchange in the dependent variable, distance. An increase of one kilometer indistanceincreasesthepredictedtimeby0.0678hours,and vice-versaofcourse.Dothemathandyou’llseethattheaveragespeedis14.7mph.

Page 68: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.5Predictingwiththemodel

Amodel such as this makes estimating and quoting for jobs much easier. If amanagerwantedtoknowhowlongitwouldtakeforatrucktomakeajourneyof2.5kmsforexample,allheorshewouldneedtodoistoplug2.5intodistanceandget:

predictedtime=1.27391+0.06783*2.5=1.4435hours.

multiplythisresultbythecostperhourofthedriver,addthemiscellaneouschargesandyou’redone.

Itisunwisetheextrapolate.Onlymakepredictionswithintherangeofthedatawithwhichyoucalculatedyourmodel(seealsoChapter4onthis.

Page 69: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 70: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.6Howitworks:Least-squaresmethod

Themethod we just used to find the coefficients is called themethodof leastsquares .Itworks like this: the software tries to find a straight line through thepointsthatminimizestheverticaldistancebetweenthepredictedyvalueforeachx(thevalueprovidedbythetrendline)andtheactualorobservedvalueofeachyforthatx.Ittriestogoascloseas itcan toallof thepoints.The verticaldistanceissquaredsothattheamountsarealwayspositive.

Theplotbelowisthesameastheoneabove,exceptthatIhaveaddedthetrendline.

Theblacklineillustratestheerror

Theslopeofthetrendlineisthecoefficient0.0678.Thatistherateofchangeofthedependentvariableforaone-unitchangeinthe independent variable. Think “riseoverrun”.Iextendedthetrendlinebackwardstoillustratethemeaningofintercept.Thevalueoftheinterceptis1.27391,whichisthevalueofywhenxiszero.Thisisactually a form of extrapolation, and because we have no observations of zerodistance,thisisanunreliableestimate.

Nowlookatthepointwherex=100.Thereareseveralyvalues representingtimetakenforthisdistance.Ihavedrawnablackline

Page 71: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

betweenoneparticularyvalueandthetrendline.Thisverticaldistancerepresentsanerrorinprediction:ifthemodelwasperfect, all theobservedpointswould liealongthetrendline.Themethodofleastsquaresworksbyminimizingtheverticaldistance. Itispossibletodothecalculationsbyhand,buttheyaretediousandmostpeopleusesoftwareforpracticaluse.Thegapbetweenpredictedandobservedisknownasaresidualanditisthe‘e’terminthegeneralformequationabove.

Theerrordiscussedjustaboveistheverticaldistancebetweenthepredictedandtheobserved values for every x value. The amount of error is indicated by the r-squaredvalue .r-squaredrunsfrom0(acompletelyuselessmodel)to1(perfectfit).

Intheglossary,underRegression,Ihavewrittenupthemaththatunderpins theseresults.

Page 72: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.7Addinganothervariable

Ther-squaredof0.66wefoundwithoneindependentvariableisreasonableinsuchcircumstances, indicating that our model explains 66% of the variability in theresponsevariable.Butwemight dobetter by adding another variable to explainmoreofthevariability.

Thetruckingdatasetalsoprovidesthenumberofdeliveries that the driver has tomake.Clearly,thesewillhaveaneffectonthetimetaken.Let’sadddeliveriestotheregressionmodel.

Notethatyou’llneedtocutandpastesothattheexplanatoryvariablesareadjacenttoeachother.Itdoesn’tmatterwheretheresponsevariable is,but theexplanatoryvariablesmustbe adjacentinoneblock.Thenewresultisbelow:

Page 73: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Nowwithdeliveries

Notethatther-squaredhasincreasedto0.903, so the newmodel explains about25%moreofthevariationintime.Theimprovedmodelis:

y =− 0 . 869+0 . 06∗Dist+0 . 92∗Del

A few things to note here: the values of the coefficients have changed. This isbecausetheinterpretationofthecoefficientsinamultiplemodellikethisisbasedononlyonevariablechanging .Forexample, thecoefficientofdistanceis0.92,almostonehourforeachdeliveryassumingthatthedistancedoesn’talsochange.Weshouldinterpretthecoefficientsundertheassumptionthatalltheothervariablesare‘heldsteady’,apartfromthecoefficientofinterest.

Page 74: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.8Dummyvariables

Above, we saw how adding a further variable has dramatically improve theaccuracyofamodel.Adummyvariableisanadditional variablebutone thatweconstructourselvesasa resultofdividing data into two classes, for example bygender.Dummyvariablesare

Page 75: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

powerful because they allow us to measure the effect of a binaryvariable, known as a dummy or sometimes indicator variable. Adummyvariabletakesonavalueofzeroorone,andthuspartitionsthedataintotwoclasses.Oneclassiscodedwithazero,andiscalledthereference group . The other classes are coded with a one andsuccessive numbers.

Theremay bemore than two groups, but therewill always be onereference group. W eare generating an extra variable, so theregressionequationlookslike this:

y =b 0+b 1 x 1+b 2 x 2

In the case of observations which have been coded x1=0 (thereference group), then the b1 term will disappear because it ismultipliedbyzero.Theb2termremains.Forthereferencegroup,theestimatedregressionequationthensimplifiesto

y ˆ reference=b 0+b 2 x 2

whileforthenon-referencegroup,itis

y ˆ nonreference=b 0+b 1 x 1+b 2 x 2

Thesizeofb1x1represents thedifferencebetweenthe averagesizeofthereferencelevelandwhatevergroupisthenon-referencegroup.Let’sworkthroughanexample.Creatingadummyvariable⁵

Thedataset[‘gender’](https://dl.dropboxusercontent.com/u/23281950/gender.xlsx)containsrecordsofsalariespaid,yearsofexperienceandgender.

Wemightwantto knowwhethermen andwomen receive the samesalarygiventhesameyearsofexperience.Loadthedata,then run alinear regression of Salary on Years of Experience, using years ofexperienceasthesoleexplanatoryvariable.Theresultisbelow:

⁵https://www.youtube.com/watch?v=TBJsEb2UCPs

Page 76: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Thegenderregression

The interpretation is that the interceptof$30949 is the aver- age starting salary,withyearsofexperiencezero.Thecoefficient

$4076meansthateveryadditionalyearofexperienceincreasestheworker’ssalarybythisamount.

Ther-squaredis0.83,sothesimpleone-variablemodel explainsabout83%ofthevariation in the dependent variable, salary. We have one more variable in thedataset,gender.Addthisasadummyvariableandruntheregressionagain.Notethatyouwillhavetocutandpaste thevariablesso that theexplanatoryvariables areadjacent.

Page 77: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Withthegenderdummyadded

Ther-squaredhasincreasedtonearly1,soournewmodelwiththeinclusionofgenderisveryaccurate.Theimportantnewvariableisgender,codedasfemale=0andmale=1.Nothingsexistaboutthis,wecouldequallywellhavereversedthecoding.Ifyouarefemale,thenthemodelforyoursalaryis:23607.5+4076*Years

ifyouaremale,thenthemodelforyoursalaryis23607.5+114683.67+ 4076Years

Eachextrayearofexperienceprovidesthesamesalaryincrease,butonaveragemalesreceive$14683.67moreearnings.

Hereisanotherexample,usingthemaintenancedataset.Dummyvariable⁶

Anotherdummyvariableexampleandacautionarytale

Themariokartdatasetcamefrom[OpenIntroStatistics](www.openintro.org)awonderfulfreetextbookforentrylevelstudents.Thedataset

⁶https://www.youtube.com/watch?v=Yv681upodDI

Page 78: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

contains informationon thepriceofMarioKart in ebay auctions.First let’s testwhethercondition‘new’or‘used’makesdifference.ConstructanewcolumncalledCONDUMMY,codednew=1andused=0.Nowrunaregressionwithyournewdummyvariableagainsttotalprice.Theoutputisbelow

Theconditiondummyresults

Thisresultistremendouslybad.Thepvalueis0.129,meaningthatconditionisnotastatistically significant predictor of price. Surely the conditionmust have somesignificant effect?Wait—we forgot to do some visualization. To the right is ahistogramofthetotalpricevariable.

Lookslikewemighthaveaproblemwithoutliers…someof the observations aremuchlargerthantheothers.Takeanotherlookwithabarchartatthehigherprices.

Page 79: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

marioKarttotalprices

Wecanidentifytheseoutliersusingzscores(seetheGlossary).Orwecouldjustsortthembysizeandthenmake

avaluejudgementbasedonthedescription.That’swhatIhavedoneinthisYouTube.Outlierremovaland regres-sion⁷

Thetotalpricesasabarchart

It seems that two of the items listed were for grouped items which were quitedifferentfromtheothers.Thereisthereforealegitimate reasonforexcludingthem.Belowarethenewresults:

⁷https://www.youtube.com/watch?v=ivLGteHuu3Q

Page 80: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

CorrectedmarioKart output

Theestimatedregressionequation is

y =42 . 87+10 . 89∗CONDUMMY

Theconditiondummywascodedasnew=1,used=0.Ifamariokartisused,thenitsaveragepriceis42.87,ifitnewthentheaveragepriceis42.87+10.89.Theaveragedifferenceinpricebetweenoldandnewisnearly$11.Makessense.

Take-home:checkyourdatabeforedoingtheregression.

Page 81: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.9Severaldummyvariables

ThegenderandthemarioKartexamplesabovecontainedjustonedummyvariable.But it is possible to havemore. For example, your sales territory contains fourdistinctregions.Ifyoumakeoneofthefourthereferencelevel,andthendivideupthedatawithdummies

Page 82: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

for the remaining three regions, you can compare performance in each of theregionsbothtoeachotherandtothereferencelevel.

Thedatasetmaintenancecontainstwovariableswhichyoucanconverttodummies,followingthisYouTube.

MaintenanceRegression⁸

Page 83: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.10Curvilinearity

Sofar,wehaveassumedthattherelationshipbetweenthedepen-dentvariableandtheindependentvariablewaslinear. However,inmanysituationsthisassumptiondoes not hold. The plot below showsethanolproduction inNorthAmerica overtime.

NorthAmericanethanolproduction.Source:BP

ThesourceofthedataisBP.Itisclearthatproductionofethanolisincreasingyearlybutinanon-linearfashion.Justdrawingastraightlinethroughthedatawillmisstheincreasingrateof production.Wecancapturetheincreasingratewithaquadraticterm,which is simply the time element squared. I have created a new variablewhichindexestheyears,whichIhavecalledt,andafurthervariable

⁸https://www.youtube.com/watch?v=xWdcT7u9YFE&feature=em-upload_owner

Page 84: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

whichistsquared.

urvilinearregressionYouTube⁹

Thedatasetwithanindexfortime.Source:BP

Aregressionoftagainstoutputhasanr-squaredvalueof0.797.Inclusionofthetsquaredtermincreasesther-squaredmarkedlyto

0.98.Theregressionoutputisbelow.

Regressionoutputwiththequadraticterm

Wewouldwritetheestimatedregressionequationas

y =4504−1301 t+194 . 9 t 2

Noticethatforearlysmallervaluesoft,theeffectofthequadratic

⁹https://www.youtube.com/watch?v=jgjWSpyPBqg

Page 85: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

termisnegligible.Astgetslargerthenthequadraticswampsthelineartterm.

Makingaprediction .Let’spredictethanolproductionafter5years.Then

y =4505−1301∗5+194 . 9∗25=2872 . 5

Theplotbelowshowspredictedagainstobservedvalues.While the fit is clearlyimperfect,itiscertainlybetterthanastraightline.

Predictedagainstobservedethanolproduction.

Page 86: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.11Interactions

Increasingthepriceofagoodusuallyreducessalesvolume(al- thoughof courseprofitmightnotchangeifthepriceincreasessufficientlytooffsetthelossofsales.Advertisingalsousuallyincreasessales,otherwisewhywouldwedoit?

What about the joint effect of the two variables? How about reducingthepriceandincreasing theadvertising?The jointeffect is called an interaction and caneasilybeincludedintheexplanatoryvariablesasan extra term.Theoutput belowistheresultof

Page 87: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

regressingsalesonPriceandAdsforaluxurytoiletriescompany.Theestimatedregressionmodel is

y =864−281 P+4 . 48 Ads

Regressionoutputforjustpriceandads

Sofarsogood.Thesignsofthecoefficientsareaswewouldexpectfromeconomictheory.Salesgodownaspricerises(negativesignonthecoefficient).Salesgoupwithadvertising(positivesign onthecoefficient).Caution:donotpaytoomuchattention to the absolute size of the coefficients.The fact that price has a muchlargercoefficientthanadvertisingisirrelevant.Thecoefficientisalsorelatedtothechoiceofunitsused.

Nowcreateanothertermwhichispricemultipliedbyads.Callthis termPAD.Thefirstfewlinesofthedatasetarebelow.

Page 88: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Firstfewlineswiththenewinteractionvariable

Nowdotheregressionagain,includingthenewterm.

Inclusionoftheinteractionterm

The new term is statistically significant and the r-squared has increased to0.978109.Thenewmodeldoesabetterjobofexplainingthevariability.Howcomepriceisnowpositiveandtheinteractionterm is negative?We have to look at theresults as a whole rememberingthatthesignsworkwhenalltheothertermsare‘heldconstant’.Theexplanation:asthepriceincreases,theeffectofadvertisingonsalesisLESS.Youmightwanttolowertheprice andseeiftheincreasedvolumecompensates.

Page 89: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Anotherinteractionexample

Here’sanotherexample,thisonerelatingtogenderandpay. Thedatasetiscalled‘paygender’ and contains information on the gen- der of the employee, his/herreview score (performance), years of experience and pay increase. We want toknow:

•Isthereagenderbiasinawardingsalaryincreases?

•Inthereagenderbiasinawardingsalaryincreasesbasedontheinteractionbetweengenderandreviewscore?

First,let’sregresssalaryincreasesonthedummyvariableofgenderandalsoReview.Myresultsarebelow(IhavecreatedanewvariablewhichIhavecalledG.Itisjustthegendervariablecodedwithmale

=0andfemale=1.

GdummyandReview

Nothingverysurprisinghere:womengetpaidonaverage233.286 lessthanmen.And—holdinggendersteady–eachpoint increase in Reviewgives an increase insalaryof2.24689.

How do I know thatwomen are paid less than men? The estimated regressionequationis:

Page 90: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

y =204 . 5061 − 233 . 286 ∗ ( x=1 ifF )−233 . 286∗( x=0ifM )+2 . 24689∗Review

Rememberhowwecodedmenandwomen?Thex=0ifMtermdisappears,soweareleftwiththisequationformen:

y =204 . 5061+2 . 24689∗Review

andthisoneforwomen(I’vedonethesubtraction)togive

y =− 28 . 779+2 . 24689∗Review

Conclusion:thereisagenderbiasagainstwomen.ForthesameReviewstandard,onaveragewomenarepaid233.286lessthanmen.

How about the second question — possibility that the gender biasincreases with Review level? Wecan test this with an interactionvariable.MultiplytogetheryourgenderdummyandtheReviewscoretocreateanewvariablecalledInteract.Thendothe regression againwith salary regressed against G, Review and Interact. My output isbelow.

Page 91: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Interactionoutput

Formen,theequationisSalary=59.94472+4.848995(becausetheGandInteracttermsgotozerobecausewecodedMale=0.SoforeveryextraincreaseinReview,aman’ssalaryincreasesby4.848995.

Forwomen,theequationisSalary=59.94472-29.714-1*(4.848995-4.05468)soforeveryone increase inReview,awomen’ssalary increasesbyonly 4.848995-4.05468=0.79. Notice that the adjusted r- squared for thismodel is higher (it is0.927339comparedto0.810309) thanthemodelwithjustGandReview.So theinteractionisbothstatisticallysignificant,pvalue0.000316)andhasthemeaningthatforwomen,improvingtheReviewscoredoesn’tincreasethesalaryasmuchasformen.

ConclusionTheanalysisshowedthatwomenarebeingtreatedunfairly.Thereisagenderbiasagainsttheminaveragesalariesfor thesameperformancereview;andeach increase in reviewpoints earn themconsiderably less insalary thanfor theequivalentmale.Caution: thisconclusion isbasedsolelyon the limitedevidenceprovidedbythissmalldataset.

Page 92: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 93: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.12Themulticollinearityproblem

Multicollinearityreferstowayinwhichtwoormorevariables ‘explain’ thesameaspectof thedependentvariable.Forexample, let’ssaythatwehad a regressionmodelwhichwasexplainingsomeone’ssalary.Employeestendtogetpaidmoreastheygetolderandalsoastheiryearsofexperienceincreases.Soifwehadbothageand years of experience in the list of independent variables we would almostcertainlysufferfrommulticollinearity.

Multicollinearitycanleadtosomefrustratingandperplexing re-sults.Yourunaregression.Oneofthevariableshasapvaluelargerthan0.05,soyoudecidetotakeitout.Youruntheregressionagainand—-thesignand/or significanceofanothervariables changes. This happens because the two variables were explaining thesameaspectofthedependentvariablejointly.

Howtosolvethisproblem:checkthecorrelationoftheindependentvariablesfirst,beforeputtingthemintoyourmodel.Chapter14hasasectiononcorrelation.Ifyoufindthatthecorrelationofanytwovariablesishigherthan0.7,besuspicious.Thesetwomaybringyoursomegrief!

Page 94: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.13Howtopickthebestmodel

Youwillbetryingoutdifferentformulations,addingandremovingvariablestotrytocaptureasmuchexplanatorypowerasyoucan.Howyoudodecideifonemodelisbetterthananother?Therearetwoapproaches:

• Lookonlyattheadjustedr-squaredvalue,evenif yourmodel containsvariableswith a p value larger than 0.05. If the adjustedr-squaredvaluegoesup,leavesuchvariablesin.

•Pruneyourmodelsothatitcontainsonlyvariableswithapvalue<=0.05.Youstillwantashighanadjustedr-squaredas

Page 95: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

youcanget,butyoualsowantallyourexplanatoryvariablestobestatisticallysignificant.

Itakethelatterapproach.Youusuallyhavefewervariablesbutallof themhaveareason(statisticalsignificance)forbeinginyourmodelandyoucaninterprettheirmeaningintelligently.Aparsimoniousmodelisbetter thanacomplexmodel thatfitsyourdataverywell.Simpleandrobustisgood.Thedatasetthatyouusedtofityourmodelisonlyasamplefromapopulation.Anoverlycomplexmodelmaynotworkwellwhenpresentedwithadifferentsamplefromthepopulation.

Page 96: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.14Thekeypoints

• Think throughyourmodelbeforeyoustart including vari-ables.Whatvariablesdoyouthinkwillhaveaneffectonthedependentvariableandinwhichdirection(plusorminus).Itistemptingtojustputineverythingandhope for the best but this rarely works. Some software is able to dostepwiseregression,pullingoutinsignificantvariablesforyou, butExcelisnotoneofthem.

• Keepyourmodelassimpleaspossible.Complicatedmodels rarelyworkwell.

• Visualiseyourdatafirst,evenwithasimplescatterplotaswehavedonethroughoutthischapter.

•Checkwhetheryouhaveallthevariablesthatyoumightneed.Ifyouweretrying topredictwhetherashopselling expensive jewellerywouldmakesufficientsales,youmightwanttoknowtheaverageincomeofresidents.Ifyoudon’thaveit—getit.Thereisahugeamountofdatalyingaboutwhichyoucanobtain either free or quite cheaply. I’veincludedaverybrieflistofURLsinChapter12.

Page 97: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 98: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

6.15Workedexamples

1.Theestimatedregressionequationforamodelwithtwoindependentvariablesand10observationsisasfollows:

y =29+0 . 59 x 1+4 . 9 x 2

Whataretheinterpretationsofb1andb2inthisestimatedregres-sion equation?

Answer:thedependentvariablechangesbyonaverage0.59whenx1changebyoneunit,holdingx2constant.Similarlyforb2.

Predictthevalueoftheindependentvariablewhenx1is175andx2is290:

y =29+0 . 59(175)+4 . 9(290)=1553 . 25

Page 99: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 100: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.Checkingyour regressionmodelItisn’tdifficulttobuildaregressionmodelasthepreviouschapterhasshown.Butthatispartoftheproblem:everybodydoesit.Butnoteverybodytakesthetroubletochecktheresults.Checkingtheresultsisanimportantstepfortworeasons:first,tomakesureyourcalculationsandpredictionsarecorrect;andsecondlytoshow thirdparties thatyourwork is solid. In this chapterwe’llworkondoing just that. Todecidewhetherthemodelswehavebeenworkingonareanygoodweneedtolookattwoareas:

•Isyourmodelstatisticallysignificant?

•Aretheassumptionsbehindtheleastsquaresmethodmet?Inparticular,linearityandresidualdistribution.

Thefirstareaiseasiertoworkthroughthanthesecond,whichisprobablywhyonedoesn’talwaysseeresidualanalysisdiscussedwhenresultsarepresented.Thisisashamebecausethereisagreatdealtobelearnedfrompickingthroughresiduals.Youranalysiswillbegreatlyimprovedbyacloseattentiontothisarea.

Belowwe’llworkthroughstatisticalsignificanceandthentesttheassumptions.

Page 101: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.1Statisticalsignificance

Theregressiontrendlinedisplaysarateofchangebetween twovariables.Thelineslopesupwardswhenthedependent variableincreasesforaone-unitincreaseinthe

independentvariable(

61

Page 102: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

positive relationship) and vice-versa (negative relationship). What we want toknowis this:did thatslopeupwardsordownwards occur by chance; ifwe tookanothersamplefromthepopulationwouldweachieveasimilarresult?Keypoint:weare usuallyworkingwithasampledrawnfromapopulation.Thesamplein thetruckingexamplewasthefirm’slogbook.

Thenullhypothesisisthatthereisnoslope;withthealternativethatthereisinfactaslope.Thehypotheses(seetheGlossaryforhelponhypotheses)totestwhetherthereisastatisticallysignificantslopeare:

Thenull:

H o : β =0

andthe alternative:

H a : β = 0

where

β

isthesymbolforthecoefficientsofthepopulationparameter of the independentvariableofinterest.Inwords,thehypothesisistestingwhetherbetaiszeroornot.Ifit is zero, then we are ‘flatlining’ and there is no point in continuing with theanalysis.Ifhoweverwecanrejectthenull,thenweacceptthealternativewhichisthatbetaisnotzero.Itmightbenegativeorpositive,thatdoesn’tmatter:whatmattersiswhether it is zero or not. Note that we use a Greek letterwhendiscussingapopulationparameterwhichwillprobablyneverbeaccuratelyknown.Weuse theLatin letter ‘b’ when we have estimated it using a sample drawn from thepopulation.

Page 103: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Exceldoesallthetestingofstatisticalsignificanceforusintwoways.

First,itteststheindividualvariablesandprovidesapvalue.Belowistheresultfromthefirsttruckingregressionwedid.Notethatthepvaluefordistanceis0.004.Itiscustomarytouseacut-offof0.05.Themeaningofthisresultisthat,followingtherejectionrule ,werejectthenullifthepvalueissmallerthan0.05.Because0.004issmallerthan0.05,werejectthenullwe therefore conclude that there is in factastatisticallysignificantslope.Insummary,wetestedthehypothesis that the slopewaszero,andrejectedit.Thereforeweacceptthealternativewhichisthatthereisaslopeofsomesort.

Thetruckingregressionoutput

Thistest tellsusthat thereisastatisticallysignificantslope.Itdoesn’ttellusthesignoftheslope(upordown)oritsmagnitude.Butthefactthatthereisasignificantslopeisimportantinformation.Itisworthcarryingonwiththeanalysisbecausethevariablesactuallymeansomething.Bytheway,ignorethepvaluefortheintercept.Itusuallyhasnosubstantivemeaning.

ThesecondwaythatExcelteststhevalidityofthemodelistotestwhetherthemodelasawholeissignificant.ThetestiscalledtheFtestafterthestatisticianRAFisher.Forthetruckingexample,lookundersignificanceF.Theresultisapvalue,inthiscasethesame

Page 104: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

pvalueforthevariableDISTANCE.Wehaveonlyoneexplanatoryvariableandsothepvaluewillbethesame.Wehopethatthepvalueissmallerthan0.05,whichenablesustoconcludethatatleastoneoftheslopesisnon-zero.

The section above examined the first of the tests, which was for the statisticalsignificanceofthemodel.Nowwemoveontothesecondarea.

Page 105: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.2Thestandarderrorofthemodel

ThestandarderrorinExcelisfoundjustundertheadjustedr-squaredoutput.Itistheestimatedstandarddeviationoftheamountofthedependentvariablewhichisnotexplainablebythemodel.Inotherwords,itisthestandarddeviationoftheresiduals.

Undertheassumptionthatthemodeliscorrect,itisthelowerboundonthestandarddeviationofanyofthemodel’sforecasterrors.Thefigurebelowshowstheoutputofregressionestimatingriskofastroke(multipliedby100)againstbloodpressureandage,withadummyvariableforsmokingornot.

Page 106: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Exceloutputforstrokelikelihood

Thestandarderroris6(roundedup).Foranormally-distributedvariable,95%ofthe observations will be within two standard deviations of the mean. For ourpurposesthatmeansthat2x6=12shouldbeaddedorsubtractedtothepredictedvaluetofindaconfidenceintervalforpredictions.Theestimatedregressionequationfromthestrokemodelaboveis:

y =− 91+1 . 08 Age+0 . 25Pressure+8 . 74 Smokedummy

Animaginarypersonwhosmokes,isaged68andhasa bloodpressureof175willhaveariskof34(allfiguresrounded).Wecanbe95%confidentthatthisestimatewillfallsomewherebetween34-12and34+12or22to46.Theseareratherwideconfidenceintervalsandsowemightwanttoworkatimprovingthemodelbyforexampleincreasingthesamplesizeoraddingmoreexplanatoryvariables.

Page 107: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 108: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.3Testingtheleastsquaresassumptions

Therearetwokeyassumptionsthattheleastsquaresmethodreliesonwhichweneedtocheck.Aftercheckingwe’llworkthroughwaysofretrievingthesituationshouldtheassumptionsbeviolated.Theassumptionsare:

• Therelationshipbetweenthedependentandtheindependentvariablesislinear.Fortunatelythisiseasytocheckandalsotofixifitisn’tsatisfied.

• The residualshaveanon-constantvariance. (You’ll some- timesseeinotherstatisticstextbooksotherstricterrequire-ments,suchastheresidualsbeingnormallydistributedandwithameanofzero.Formostpurposesjustchecking for non-constant variance is enough). What doesnon-constantvariance mean? We’ll work through this by defining a resid- ual andidentifyingwhetherornottheassumptionhasbeenmet.Andfinallywhattodoaboutit.Butfirst,checkingforlinearity.

Checkingfor linearity

Therelationshipbetweenthedependentvariableandtheindepen- dentvariableisassumedtobelinear.Thisisimportantbecausethemodelgivesusarateofchange:thecoefficientshowsthechangeinthedependentvariableforaone-unitchangeintheindependentvariable.Iftherelationshipisnon-linear,thatcoefficientwill notbevalidinsomeportionsoftherangeofthevariables.

Wewanttoseeastraightline(eitherupordown)onascatterplot.Putthedependentvariable on the vertical y axis, and one of the independent variables on thehorizontal x axis. If you include the trendline,youcanobservehowclosely theobservedvaluesmatchthepredicted.

Page 109: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Wecanalsocheckforlinearitybyplottingthepredictedvaluesand theobservedvalues.Theplotbelowdoesthisforriskofstroke:

Predictedagainstobservedrisk

Thisisareasonableresult,indicatingthatthelinearityconditionismet.Mostofthepointsareinastraightline,althoughIwould be concerned about the lower risklevels,especiallyaroundrisk=20.Thereappearstobeconsiderablevarianceatthispoint.

Page 110: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.4Checkingtheresiduals

Aresidualisthedifferencebetweentheobservedvalueofyforanygivenxvalue,andthepredictedvalueofyforthatsamexvalue.Itisthereforeapredictionerror,andisgiventhenotationefortheGreekletter‘epsilon’.Itistheverticaldistancebetweentheactualandpredictedvaluesofthedependentvariableforthesamevalueoftheindependentvariable.Wediscussedthisinthepreviouschapter inconnectionwiththeleastsquaresmethod,wherewewrotetheestimationequation:

y = b 0+b 1 x

Page 111: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Thehatony,thedependentvariable,indicatesthatitisanestimateofthedependentvariable.Weknowthattheestimatecannotbecorrectunlessallthepredictedvaluesandtheobservedvaluesmatchupexactly.Iftheydon’t,thenthereareresiduals.Ifwecalltheerrorsε(epsilon)whichistheGreeklettermatchingourlettere,thenwecanrewritetheregressionequationas:

y=β o+β 1x+ε

Theerrorshavenowbeenabsorbedintoepsilonandsowecanremovethehatfromy.Itisthebehaviorofepsilonthatisofinterest, becausetheleast-squaresmethodrestsontheassumptionthatthe errors absorbed into epsilon have a non-constantvariance. Thismeansthattherenorelationshipbetweenthesizeoftheerrortermandthesizeoftheindependentvariable.Therefore,ifweplottedtheerrorsagainsttheindependentvariables,weshouldseenoclearpattern.Howtodo this isdescribedbelow.

Page 112: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.5Constructingastandardizedresidualsplot

Excelprovideswhatiscallsstandardresidualsaspartofitsregres-sionoutput,andwewillusethese.Notehoweverthatthesearenot‘true’standardizedresiduals,buttheyareprobablycloseenough.MakesureyouchecktheStandardizedResidualsboxwhensettingupyourregression.

Page 113: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Checkthestandardizedresidualsbox

Wewant toplot thestandardresiduals against the predicted value yhat.Wewillcreateanewcolumn to the left of the columnStandard Residuals, and copy thecolumnofpredicted times into that new column. Then create a scatter plot ofpredictedtimeandstandard residuals.Youshould end upwith the image below.Standardizedresidualplot¹

¹https://www.youtube.com/watch?v=1wuyZfB39p4

Page 114: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Standardresiduals

TheYaxisindicatesthenumberofstandarddeviationsthateachresidualisawayfromthemean,plottedagainstthepredictedvaluesonthehorizontalaxis.Noneoftheresidualsaremorethat2standarddeviationsawayfromthemeanofzero,sotheresultsaregenerallysatisfactory,althoughthereisoneobservationinexcessof1.5.Westill have a worrying fan shape which would seem to indicate non-constantvariance:wecanobserveapattern.

Let’sruntheregressionagain,butnowincluding the second vari- able,which isdeliveries.Theresidualsplotisobtainedinthesameway,andhereistheresult.

Page 115: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Standardizedresidualswithtwovariables

Thefanproblemseemstohaveimprovedalittle.Ther-squaredof themodelwithtwovariableswashigher,meaningthattheresidualsweresmaller (because there islessunexplainederror).

Page 116: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.6Correctingwhenanassumptionisviolated

Above,weexaminedpossibleviolationsoftwooftheassumptionsunderlyinglinearregression:linearityandnon-constantvariance.Nowlet’slookatwhattodowhenthese assumptions are found to havebeenviolated.Firstsomegoodnews: linearregressionisquiterobusttosuchviolations,andevensotheyarequiteeasytocorrectfor.We’ll deal with the problems in the same order: linearity and non-constantvariance.

Page 117: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.7Lackoflinearity

The firstcheck is to lookat a scatter plot of the dependent variable against theindependentvariable.Theplotbelowshows volume of sales and length of timeemployed,togetherwithatrendline.

Page 118: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Acurvilinearrelationship

Thereareseveraldifferentwaysinwhichalinearrelationshipcanbeachievedsothat we can use the least squares method. A common method is to include aquadraticterm.Thismeansadding thesquaredvalueoftheindependentvariabletothelistofindependent variables.This iseasilydonebycreatinganothercolumn,consistingofsquaredvaluesofthefirstvariable.

Anewcolumnoftheindependentvariablesquared

Todothis,createa newcolumnand then label it.Clickon the first value in thevariablewhichyouwanttosquare,andaddˆ2.Enter. Then drag downwards. Theplotbelowshowsthepredicted andobservedsales.Themodelisclearlysuperior.

Page 119: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Predictedandobservedafterinclusionofquadraticterm

Ifthedependentvariablehasalargenumberoflowvalues,andisheavilyskewedtotheright,thenagoodsolutionistotransformthedependentvariableintoitsnaturallogarithm.

Page 120: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.8Whatelsecouldpossiblygowrong?

Regressionisaverycommonlyusedanalyticaltoolandyouaremostlikelyeithergoingtouseityourselforexaminetheworkofotherswhohaveusedregression.Below is a list of common mistakes to watch for. And if I have made themsomewhereinthisbook,I’msureyou’llbethefirsttoletmeknow!

Page 121: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.9Linearitycondition

Regressionassumesthattherelationshipbetweenthevariablesis linear: the trendlinethatthesoftwaretriestofindtominimizethesquareddifferencebetweentheobservedandpredictedvaluesisstraight.Soifyoutrytorunaregressiononnon-lineardata,you’llgetaresultbutitwillbemeaningless.

Page 122: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Action :alwaysvisualizethedatabeforedoinganyanalysis.Ifyouseethatthedataisnon-linear,youmaybeabletotransformitusing the techniquesdescribed in thepreviouschapter.Asanexample: thecurvilinearexamplewhichwastransformedbyincludingaquadratictermasanexplanatoryvariable.

Page 123: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.10Correlationandcausation

Regressionisaspecialcaseofcorrelation,andasweallknowcorrelationdoesn’tmeancausation(seetheGlossary).Inregression,nomatterhowgoodthemodel,allthat we have been able to show is that a change in an explanatory variable isassociatedbyachangeinthedependentvariable.Forexample,yourecordthehoursyouputintostudyingandyourgrades.Surprise!Morestudying=bettergrades?Orperhapsnot….couldhavebeenabetter instructor.Wecannotsaythatonecausedtheother.Sowhenwritingupresultsorinterpretingthoseofothers,beverycarefulnottoclaimmorethanyouareableto.

There is a related problem which is ‘reverse causation’. You study more, yourgradesgoup.Temptingtothinkthatonepossiblycausedtheother.Butperhapsitistheotherwayround?Yourgradeswerepoor,youstudiedharder?Oryouhadgoodgradesandthenledyouintotakingthecourseseriously.

Action : try not to include explanatory variables which are affected by thedependentvariable.Youcouldalsotryto‘lag’onevariable.Cutandpastethehoursofstudyingvariablesothatitisonetime-periodbehindthegradesandthenruntheregressionagain.

Page 124: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.11Omittedvariablebias

Under Correlation in the Glossary I give some examples of lurking variables.Regression, like correlation, is susceptible to the same problem.Example: younoticethatinhotterweathertherearemore

Page 125: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

deathsbydrowning.Didthehotterweathercausethedrownings?Wellno,theextraswimmingcausedbytheheatpresentedmoreriskscenarios.Thelurkingvariableishoursspentswimmingratherthantemperature.Trytogettotherealvariableifyoucan.

Page 126: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.12Multicollinearity

Whenpredicting the price of a house, square footage and number of roomsarelikelytobehighlycorrelatedbecausetheyarebothex-plainingthesamething.Thisproblemresultsinunusualbehaviorintheregressionmodel.

Action : correlate your explanatory variables BEFORE doing any regression.Watchoutifyouhaveapairwhichhasanrvalueof

0.7orhigher.Thisdoesn’tmean thatyou shouldn’tuse them, but it’s ared flag.Thisiswhatmighthappen.Youtakeoutononethevariablesina regression, theonethat’sleftreversesitssignorsuddenlybecomesstatisticallyinsignificant.

Ifyoudoruntheregressionandyougetanunlikelyresult,choosethevariablewiththehighesttvalue(orsmallestpvalue)andditchtheother.

Page 127: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

7.13Don’textrapolate

Thecoefficientsfromtheregressionarecalculatedbasedonthedatayouprovided.Ifyoutrytopredictforavaluebeyondtherangeof thatdata, the results will beunreliableifnottotallywrong.

In theyearsofexperienceandsalaryexample inChapter5, theco- efficient foryearsofexperienceprovidedthechangeinsalarythatanextrayearofexperiencewouldgive.Wouldyoufeelcomfortablepredictingthesalaryofsomeonewith95yearsofexperience?

Action :beforerunningthenumbers,checkthattheinputsarewithintherangeforwhichyoucalculatedthemodel.

Page 128: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 129: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

8.TimeSeries IntroductionandSmoothingMethods

Atimeseriesisasetofobservationsonthesamevariablemeasured atconsistentintervalsoftime.Thevariableofinterestmightbemonthlysalesvolume,websitehitsoranyotherdataofrelevance to thebusiness.Using timeseriesanalysiswecan detect patterns and trends and–just possibly–make forecasts. Forecasting istrickybecause (ofcourse!)weonlyhavehistoricaldata togoonand there isnoassurance that the same pattern will repeat itself. Despite all this, time seriesanalysishasdevelopedintoahugetopic, fortunatelywithalargenumberoffreelyavailabledata-setstouse.Linkstosomeofthesedata-setsareprovidedattheendofthedatachapter(Chapter3).

There are two basic approaches: the smoothing approach and the regressionmethod .Bothmethodsattempt to eliminate the background noise. This chaptercoverssmoothingmethods, thefollowingchaptertheregressionapproach.

Page 130: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

8.1Layoutofthechapter

Herewewill

1.identifythefourcomponentsthatmakeupatimeseries

2.introducethenaive,movingaveragemethodsandexponen-tiallyweightedsmoothingmethodstomakeaforecast

3.discusswaysinwhichtheaccuracyoftheforecastscanbecalculated

76

Page 131: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 132: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

8.2TimeSeriesComponents

There are four potential components of a time series. One or more of thecomponents may co-exist. The components are: trend compo- nent; seasonalcomponent;cyclicalcomponent;randomcomponent.

•trendcomponent :theoverallpatternapparentinatimeseries.Theplotbelow shows French military expenditure as a percentage of GDP. Thedownwardtrendisapparent.

FrenchmilitaryexpenditureasapercentageofGDP

• seasonalcomponent :salesofskisorChristmasdecorationsareclearlyhigherinthewintermonths,whileicecreamandsuntancreammovemorequicklyinthesummer.Ifweknow the seasonal component, thenwe canuse this information for prediction. The time between the peaks of aseasonalcomponentisknownastheperiod .Noteherethatseasondoesn’tnecessarilymeantheseasonoftheyear.

TheplotbelowshowssalesofTVsetsbyquarter.HerethereisaconstantupwardtrendbecausemorepeoplearebuyingTV sets,butthereisalsoaseasonaleffect.

Page 133: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

TVSalesbyquarter

•cyclicalcomponent :sometimeserieshaveperiods last-inglongerthanone year.Business cycles such as the50- year KondratieffWave are anexample.Thesearealmost impossibletomodel,but that doesn’t stop thehopefulfromtrying.KondratieffhimselfendedupavictimofJosephStalinbecausehissuggestionthatcapitalisteconomieswentinwavesunderminedStalin’sviewthatcapitalismwasdoomed,andhewasexecutedin1938.

•randomcomponent :therandomcomponentisjustthat:thenoiseorbitsleftoverafterwehaveaccountedforeverythingelse.Thetell-talesignsofarandom component are: a con- stant mean; no systematic pattern ofobservations;constantlevelofvariation.

Page 134: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

8.3Whichmethodtouse?

In the next two chapters, we’ll work through the two basic ap- proaches: thesmoothingapproachandtheregression approach.Howtodecidewhichtouse?

Page 135: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Firststepasusualis tomakeasimplescatterplotofyourdata,with timeon thehorizontalxaxisandthevariableofinterestontheverticalyaxis.Ifthe result issomethingresemblingastraightline, forexampletheFrenchmilitaryexpenditure,thenuseasmoothingapproach.Ifthereisevidenceofsomeseasonality,suchaswiththeTVsalesdata,thenregressionisthewaytogo.Thisisespeciallythecaseifyouwanttocalculatethesizeoftheseasonaleffect.

Page 136: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

8.4Naiveforecastingandmeasuringerror

Forecastingisjustthat:aneducatedguessaboutwhatmighthappenatsomepointinthefuture.Forecastsarebasedonhistoricalevents, andwe are hoping that somepatternofbehaviorwillrepeatitselfwhichwillmaketheforecast‘true’.Thebestthatwecandoisto tryoutdifferentforecastingmethodsandworkout howwelltheypredicteddatawhichwealreadyknewabout.Thenwetakealeapintothedarkandhopethatthebestofthosemethodswilldoagoodjobondatathatwedon’tknowabout.Thissectionconcernsthemeasurementofforecastinga c c u r a c y .

We’regoingtoconstructanaiveforecastandusethatforecastto demonstratetwocommonmethodsofaccuracychecking:theMean SquaredError(MSE)and theMeanAbsoluteDeviation(MAD)approaches.

Anaiveforecastassumesthatthevalueofthevariableatt+1willbethesameasatt.Thevaluesarejustcarriedforwardbyonetimeperiod.Hereisanexample:

Page 137: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

ImageGaspriceswithnaiveforecast

The naive approach is sometimes surprisingly effective, perhaps because of itssimplicity.Ingeneralsimplemodelsperformwellperhapsbecauseoftheirlackofassumptions about the future.Now we will measure the accuracy of the NaiveForecastwithMADandMSE.Inbothmethodstheerroriscalculatedbysubtractingtheforecastorpredictedvaluefromtheobservedvalue.Themethodsdifferinwhatisdonewiththoseerrors.

MADmeasurestheabsolutesizeoftheerrors,sumsthemandthendividesbythenumberofforecasts.Theabsolutevalueoftheerrorisjustitssize,withoutthesign.InExcel, you can find an absolute value with =ABS(F1 - F3) where FI is theobservedvalueandFEthepredictedvalue.Usethelittlecorneroftheformulaboxtodrag it down. Then find the sum and divide by the n, which is the number ofobservations.Inmaththisis

MAD=

∑( abserror )

n

InMSE,theerrorissquaredbeforebeingsummedanddivided.

Page 138: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Squaring the error removes the problem of the negative numbers, but createsanotherone:largeerrorsaregivenmoreweightingbecauseofthesquaring.Thiscandistorttheaccuracyoftheresults.ThemathsfortheMSEisbelow

MSE=

∑( error )2

n

Theplotbelowshowstheerrorsassociatedwiththenaiveforecast-ing,theabsolutevaluesoftheerrorsandtheMAD.InExcel,use

=abs()toconverttoanabsolutevalue.

ImageErrorsandMeanAbsoluteDeviation

Page 139: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

8.5Movingaverages

Slightlymorecomplexthanthenaiveapproachisthemovingaverageapproach,whichreliesonthefactthatthemeanhasless

Page 140: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

variance than individualobservations.By focusingon themean,we can see thegeneraltrend,lessdistractedbynoise.

Theanalystdecideshowfarbackintimetheaveragewillgo.Thelongerbackintime,thenthemorevaluesoftheobservation are averaged, resulting in a flattermore smoothed result. This is good for observing long-term trends, but lesssatisfactoryifyouareinterestedinmorerecenthistory.

The number of observations which are to be averaged is known as k, and thisnumber is chosenby theanalyst basedonexperience. In Excel’s Data AnalysisToolpakthereisaMovingAveragetoolwhichcandoallthework.Positioningtheoutputisslightlytricky.Therearektimeperiodsbeingusedforthecalculationofthe first forecast;sothefirstforecastshouldbeatk+1becausektimeperiods arebeingusedtofindthefirstvalue.Whereexactlytoputthefirstcelloftheoutputisabitfiddly.Iftherearektimeperiodsbeingused, placethefirstcell at k-1.Excelprovidesachartoutput,examplebelow.ThereisaYouTubehere¹

ImageMovingaveragewithk=3forgasprices

Themovingaveragetechniqueisusedbyinvestorstodetectthe‘goldencross’andthe‘deathcross’.Theyexamine50dayand200daymovingaverages.Whenthe50daymovingaverageis above

¹http://youtu.be/zrioQOWfxjY

Page 141: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

the200daymovingaveragethispointistheso-calledgoldencrossandasignaltobuy.Bycontrast,whentheoppositehappensthisis the‘deathcross’…timetogetout!

Page 142: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

8.6Exponentiallyweightedmovingaverages

The exponential smoothingmethod (EWMA) is a further step forward from thenaiveandthemovingaverageapproaches.Themovingaverageforgetsdataolderthanthek timeperiodsspecified, while the EWMA incorporates both the mostrecentandmorehistoricalobservationstoconstructaforecast.IntheGlossaryyou’llfindmoredetailedmathshowinghowtheEWMAbringsforwardallpasthistory.Ingeneralthefurtherbackintimeyouget,thelessinfluencetheobservationshave.

UsingEWMAwechooseasmoothingconstant,alpha,whichsetstheweightgivento themost recent observation againstprevious forecasts. Ifwe thought that theforecast(t+1)wouldbesimilartothecurrentobservation(t=0)andwethoughtthatolderforecastswereof littlevalue, thenwewouldchooseahighvalueofalpha.Alpharunsfrom0to1.TheEWMAformulais

F t +1=αY t+(1−α ) F t

where F is the forecast, andY is the actual value of the variable. alpha is thesmoothingconstant,ranginginvaluebetween0and1,andchosenbytheanalyst.

Inwords, this equationmeans that the forecast value (at t+1) is the smoothingconstantalphamultiplyingtheactualvalueat t=0plus(1-alpha)timestheforecastatt=0.Soifthepreviousforecastwastotallycorrect,theerrorwouldbezeroandtheforecastwouldbeexactlywhatwehavetoday.Itturnsoutthattheexponential

Page 143: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

smoothingforecastforanyperiodisconstructedfromaweightedaverageofallthepreviousactualvaluesofthetimeseries.

Theequationaboveshowsthatthesizeofalpha,thesmoothing constant,controlsthebalancebetweenweightinggiventothemostpreviousobservationandpreviousobservations.Ifthealpha issmall,thentheamountofweightgiventoYat t=0issmallandtheweightgiventopreviousobservationsislargeandvice-versa.Ifalpha=1,thenpreviousobservationsaregivennoweightatall,andweassumethat thefutureisthesameasthepast.Thisachievesthe same result as naive forecastingdiscussedabove.Typically quitesmallvalueofalphaareused,suchas0.1.

##ApplicationofExcel’sexponentialsmoothingtool

An exponential smoothing tool is available in the Analysis ToolPak. We’lluseCanadian Gross Domestic Product per capita as an example. Here is the dataplottedonitsown.Youtube²

CanadianGDPPerCapita

Thereisasteadyupwardstrend,butwithadipin2009duetotheworldfinancialcrisis.Excelasksforadampingfactor.Thisis1-alpha.Ihavedonetheforecaststwice,oncewithanalphaof0.1

²http://youtu.be/w-GmxjrX1qg

Page 144: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

andagainwithanalphaof0.9.Thereforethedampingfactorswere

0.9and0.1.Theresultforalpha=0.9isshownbelow.

Forecastwithalpha=0.9

This is pretty good forecast, with the forecast tracking the actual observationsclosely.

Theplotbelowshowssheepnumbersascountsbyhead inCanada from1961to2006.First,let’splotthisusing0.3asalpha(arbitrarilychosen).Theplotisbelow,withadampingfactorof1-0.3=0.7.

Page 145: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Sheepnumberswithalpha=0.3

Page 146: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 147: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

9.TimeSeriesRegressionMethodsThesmoothingmethodsweworkedinthepreviouschapterarefinewhenthereisnoevidenceofseasonality,forexamplewith the sheepdata.However,smoothinghas only limited use for longer term prediction and analysis. Data that is moreinterestingforusas analystsmaybenon-linear in trend, andalso have seasonalpeaks and troughs. Using regression, we canmeasure the quantitative effect ofseasonalityandtrend,eithertogetherorseparately.Theresultisamodelwhichcanbeusedforprediction.

Belowwewill:

1.useregressiontoquantifyatrendinatimeseries

2.introduceaquadratictermtoaccountfornon-linearityinthetimeseries

3.usedummyvariablestomeasureseasonality

Page 148: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

9.1Quantifyingalineartrendinatimeseriesusingregression

TheplotbelowshowslifeexpectancyatbirthforCanadians.

87

Page 149: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Canadianlifeexpectancy

This is pretty much a straight line as one might expect in adevelopedcountrywithalargepercapitaexpenditureon healthcare.Notethatthisisforbothsexes:forwomenonlywemightexpecttoseeevenbetterfigures.IrantheregressionwithYearastheindependentvariable,explaininglifeexpectancy.Theresultisbelow.

Lifeexpectancyregressionresults

NoticethattheadjustedR-squaredvalueisclosetounity,reflecting

Page 150: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

thestraightlinethatweseeonthegraph.Thecoefficients fortheinterceptandtheindependentvariableprovidethisestimatedregressionequation

y t=58 . 47+0 . 000565∗YearThemeaning is that for everyyear after 1960 that a childwas born, his/her lifeexpectancyincreasedby0.000565years,orabout5hours.

Page 151: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

9.2Measuringseasonality

ThedatasetforTVsales(fromModernBusinessStatistics byAndersonSweeneyandWilliams)containssalesbyquarterforfouryears.There are therefore sixteenobservations.

IcreatedacolumnwhichIcalledindexsothatwecanseesales ina consecutivefashion.ThefirstquarterisSpringandthelastquarter iswinter.Itisapparentthatquarterswhicharedivisiblebyfourarehigherthanothers,andsoforth.Perhapssalesarehigherinwinter?

WecanusethedummyvariablemethodofChapter5todeterminewhetherthisistrueandalsotheextentofthedifference.WewillcreatedummiesforSummer,FallandWinter,leavingSpringasthereferencelevel.Recallthatwehavefourpossiblestatesoftheseasonvariable,andsowewillneedk-1,or4-1=3dummies.Itusuallydoesn’tmatterwhich state you choose as your reference level: just don’t forgetwhichoneyoupicked.Thedataset is below,with a columncalled Index, whichwe’llusetomeasurethetimetrend.

Page 152: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

TVSaleswiththequarterlydummies

Iwanttofindouttwothings:

whethersalesareincreasingovertime.Icandothisbyincludingtheindexasanexplanatoryvariable.Becausetheregressiontool requiresthatalltheindependentvariablebeinoneblock,IhavecopiedandpastedtheIndexcolumntotheright.

Theregressionoutput is

TVSalesregressionout

Let’swalkthrough this output line by line. First notice that the p values for allindependentvariablesaresmallerthan0.05.Thereforeallaresignificant.

ThereferencelevelisSpring,andthereforethereisnocoefficientfor this quarter.Summer has a negative sign in front of its coefficient, meaning that sales forSummerare smaller than those forSpring.Fallispositive,somoreTV sales aresold in the Fall than in the Spring. Not unexpectedly, Winter has the largestcoefficientofall,

Page 153: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

reflectingtheplot.Theindexhasapositivesign,meaningthataverageTVsalesareincreasingwithtime.

Therearethereforetwocomponentsinthisseries:atrendandaseasonalcomponent.Youtube¹

**Makingpredictions**fromtheseresults.Let’spredictthesalesforFallinsevenquarterstime.Thisis

y =4 . 7+1 . 05875+7∗0 . 147=6 . 78

Theobservedvaluewas6.8,sothepredictionwasreasonableifslightly pessimistic.

Page 154: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

9.3Curvilineardata

Thedataabovefollowedalineartrend,whichisperfectforordinary least squares,which assumes that the relationship between the dependent variable and theindependent variable is linear. But some interesting data does not follow a neatlineartrend.Theplotbelowshowscrudeoilandgaspricesovertheperiod1949to2003.

¹http://youtu.be/wyKIHInMbY8

Page 155: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Crudeandgaspricesshowingcrudepricecurvilinearity

However,ifweshowthetwoseriesonthesamegraph,asontheplotIdidinTableau,withseparateyaxesforeachtypeoffuel,wecanseethattheshapesareveryclose.

Crudeandgasondifferentaxes

Thecurvilinearityofcrudepricesisclear.Thepricescametoapeak

Page 156: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

inthe1970sasaresultofOPEC’sdecisiontorestrictsupply. Inany event,crudepricescannot be described as linear. The solution is to add an extra term to theregressionof price on time, and that is time squared. The first few lines of thedatasetarebelow.Ihaveaddedacolumncalledtorepresenttheyear,andaddedtsqwhichisjusttsquared.

Datawithtimesquared

Nowregresscrudeagainsttime and time squared, and the pleasing result belowappears

Page 157: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Thecurvilinearregressionforcrude

The adjusted r-squared is high at 0.88 and the two time predictors are highlysignificantstatistically.Noticethatthetwopredictorsarequitedifferentinsizeandalsohaveoppositesigns.Whentissmall,thenthevariabletdominatesandthetrendis upward. However, as t gets larger, then t-squared gets even larger still. Thenegativesignont-squaredpullstheregressionlinedown.

Page 158: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 159: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

10.OptimizationLinearProgramming (LP) is the tool we use to optimize a particular objectivefunction .Forexample,amanufacturerofcarpetswantstogetthemostprofitoutof his rawmaterials and labor force. The objective function is the relationshipbetween the inputs and his costs and thus his profits. Optimization helps themanufacturerfindthemostefficientallocationofresources.Theobjectivefunctiondoesn’tnecessarilyhavetobeintermsofmoney;itcouldverywellbetoallocateworkinghoursefficientlysothateveryonehasmore timeoff.Frequentlywewanttomaximize theobjective function(perhapsmakemoreprofit)butwemightalsowanttominimizeanobjectivefunction,suchascosts.

Optimizationisnotparticularlydifficult,andwewillusetheSolveradd-intoExcelto do themathematically tricky parts.What is important is thewriting up of acorrectmodelinthefirstplace.

Layoutofthechapter

Linearprogramming isnotespeciallydifficult,but theworkhas to bedone in alogicalandorderlyfashion.Inthischapter,wewill

*spendsometimeworking throughthebasicsteps in setting up an optimizationproblem *work through themeaning of the Solver output in termsofwhat thecoefficientvaluesactuallymean

Page 160: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

10.1Howlinearprogrammingworks

Linearprogrammingworksbysolvingasetofsimultaneousequa-tions.Theproblemtobesolved—tomaximizeastockreturnfor

95

Page 161: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

example—iswrittenasasetofsimultaneousequations.Theequa- tions may bequitesimple,buttheremaybemanyofthem.LinearProgrammingfindsthebestsolutiontotheequations.Thejobistofindthecoefficientsforthevariablesintheequationwhichmaximizeorminimizetheoutcome.

We’llworkthroughanexamplefirstonpaper,andthenputit intoExcel’sSolveradd-intogetthesolution.Thethreeimportantstepsare:

•Writethesimultaneousequations,basicallyputtingintomaththewordingoftheproblem.Thisisprobablythetoughestpart.

•RuntheOptimizationinSolver,tofindtheoptimalsolution.

• Sensitivity analysis to find out how much effect a unit change in aconstraintwouldhave.

Page 162: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

10.2Settingupanoptimizationproblem

Optimizationproblemsareusuallypresentedaswrittenquestions.Thebestapproachistodeterminewhicharethe

•decisionvariablesthosevariableswhichyoucancontrol.Forexample,inthecarpetfactoryexample,thenumberofhoursgiventoeachworkeris adecisionvariable.

Itisusuallythesizeofthedecisionvariablethatwearetryingtooptimize.Fortheworkers,wewouldwantthemtohaveexactlythenumberofhourswhichproducesthe most profitable output. Too few hours and the factory is working belowcapacity.On theother hand, toomanyandmoneyisbeingwasted.The decisionvariablesgointowhatSolvercallsthechangingcells.Theconventionisto color-codethesecellsinExcelinredcolor.

Page 163: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

• constraints are constants or givens which we cannot change. Thejurisdictionwhere thefactory is locatedmighthave legislation restrictingthemaximumnumber of hours that an employee canwork per day.Wewouldneedtoadd aconstrainingequationtotakeaccountofthisfact,evenifthesolutionwasfinanciallysub-optimal.

Page 164: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

10.3Exampleofmodeldevelopment.

Afarmerhas50acresoflandathis/herdisposal.Healsohasup to 150hours oflabor and up to 200 tonnes of fertilizer.He can plant either cotton or corn or amixture.Cottonproducesaprofitof$400peracre,corn$200peracre.Eachacreofcottonrequires5laborhoursand6tonnesoffertilizer.Forcorntheequivalentis3and2.Howmuchlandshouldthefarmerallocatetoeachcrop?

Let’scallacresofcottonxandacresofcorny.FforfertilizerandLforlabor.Thesefourvariablesarehisdecisionvariablesbecausehecanallocatecropstolandandhecanalsodecidehowmuchlabor and fertilizer to deploy.He is searching for thecombinationofthefourdecisionvariableswhichmaximizeshisprofit.

Weknowhehas50acres,sothetotalacreagecannotexceedthisamount.Thisisaconstraint,whichcanbewrittenasanequation,asshownbelow.Otherconstraintsarethemaximumlaborandfertilizer.Obviouslyhecannotusemorethanhehas.

Page 165: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

10.4Writingtheconstraintequations

Thetotalacreagedevotedtoeachcropcannotexceed50:

x+y≤50

Thelaborandfertilizerarealsoconstraints.Referbackto thequestiontogettheamountsoflaborandfertilizerperacrepercrop.

Page 166: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Forlabor,theconstraintequationis:

5 x+3 y≤150

Thisequationcomesaboutbecause for each acre of cotton, the farmer needs 5hours of labor,while for corn it is 3.Wedonotknowthevaluesofxandy,butwhatevertheyare,weknowthat 5xplus3ymust be less than or equal to 150,becauseweonlyhave150hoursavailable.

Forfertilizer,theconstraintequationis:

6 x+2 y≤200

Page 167: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

10.5Writingtheobjectivefunction

Whatdowewanttogetoutofthis:whatistheobjective?Clearly it is the mostefficientmixtureofinputs,subjecttoconstraints,producingthelargestprofit.Theobjectivefunctionis:

MaxProfit=400x+200y

Inwords,findthecombinationofxandythatmaximizestheprofit, subject totheconstraintswehavewritten.ThenexttaskistoputallthisintoSolver.

Page 168: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

10.6OptimizationinExcel(withtheSolveradd-in)

WewilluseanExcelspreadsheetandSolvertoachieveallthreesteps.

OptimizationYouTubeforthefarmerproblem¹

¹https://www.youtube.com/watch?v=WeTgK6wmvSY

Page 169: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

OpenanExcelspreadsheetsothatyoucanfollowalong.

Cellcoloringconventions

Isuggestyoufollowtheseconventionsforcolor-codingthecells:

•Inputcells.Thesecontainallthenumericdatagiveninthestatementoftheproblem.ColorinputcellsinBLUE.

•Changingcells.Thevaluesinthesecellschangetooptimizetheobjective.CodechangingcellsinRED.

•Objectivecell.Onecellcontainsthevalueoftheobjective.ColortheobjectivecellinGREY.

NowI’llworkthroughthefarmingproblemabovestepbystep.

MakesureyouhavetheSolveradd-inloaded.Tocheck,openExcel, thenclickontheDataTab.IfSolverisloaded,youwillseetheSolvernametotheright.

TheSolvertab

BelowistheExcelspreadsheetwiththeinformationthatweknowalreadytypedin.Ihaveputrandomvaluesinthe red-coloredchangingcells,justasplaceholders.Thesenumberswill changewhenExcelsolvesforthemostprofitableallocation.

•TypetheaddressoftheobjectiveintotheSolverdialoguebox,andmakesuretheMaxradiobuttonisselected.

•Typeintherangeofthechangingcells.

Page 170: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

•Workthroughtheconstraints.Therearethree:labor;fertil-izer;andland.

•SelectSimplexLPastheSolvingmethod.

•Presssolve.

Thefarmingsolution

Solverwillchangethevaluesinthechangingcellstomaximizetheobjectivecell.Ifoundthat30acresofcottonandnoneofcornprovidedaprofitof$12000.

Page 171: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

10.7Sensitivityanalysis

Constraintscanbeeitherbindingorslack .Ifaconstraintisbinding,thatmeansthatallofthatparticularresourceisbeingused,andmorecouldbeemployedifitshouldbecomeavailable.Wecanfindoutwhetherconstraintsarebindingandalso theirshadow

Page 172: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

pricebypressingSensitivityAnalysisafterrunningSolver again.TheSensitivityanalysiswill appear as a tab at the bottom of your worksheet. For the farmingproblem,itlookslikethis:

Farmingsensitivityreport

Let’sletattheConstraintssection.Laboruses150hoursandhasashadowpriceof$80.Theshadowpricesindicatestheper-unitvalueoftheconstrainedcommodityiftheconstraintwasincreasedbyoneunit.Ifwecouldhaveonemorehouroflabor,thentheprofitwouldincreaseby$80.Landandfertilizerhaveashadowpricesofzerobecausetheconstraint isslackornon-binding.We are not completelyusingtheseresourcesandsowedonotneedanyextrainputs.Therearesacksoffertilizerlyingaroundunused.Noneedtobuymore.

The columnsAllowableIncrease andAllowableDecrease are rele- vantbecausetheytellushowmuchmoreofabindingconstraintwecouldusebeforerunningthemodelagain.Forlabor,theamountis16.67(rounded).IfweeasedtheconstraintbymorethanthisamountthenwewouldneedtorunthenewmodelinSolveragain.Thesamelogicfordecrease.

Page 173: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 174: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

10.8InfeasibilityandUnboundedness

Solverisquiterobust,buttwoproblemsmayoccur.Asolutionto anoptimizationproblemisfeasibleifitsatisfiesalltheconstraints.Butitispossiblefornofeasiblesolutiontoexist.Thisoccurs ifyoumakeamistake inwriting themodel,or themodelistootightlyconstrained.

Unboundedness

Unboundedness occurs when you have missed out a constraint. There is nomaximum(orminimum).Trychangingallconstraintsto>=insteadof=<.

Page 175: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

10.9Workedexamples

Thedemandingmother

Mostmothersarekeentokeepcontactwiththeirchildren.Oneparticularmotherisratherdemanding.Sherequiresatleast 500minutesperweekofcontactwithyou.Thiscanbethroughtele-phone,visitingorletter-writing(yes!Somepeoplestilldothat!).Herweeklyminimumforphoneis200minutes,visiting40minutesandletters200minutes.Youassignacosttotheseactivities.Forphone:$5perminute;visiting$10perminute; letter-writing$20perminute.Howdoyouallocateyour time sothatyourcostistheminimum?

Writetheequationsfirst.Whatistheobjective?

Letxstandforminutesofphone;yminutesofvisiting;and zminutesofletters.Youwanttominimizethecostoftheseactivities, so theobjective is tominimise200x+40y+200z.Theconstraintsare:x+y+zmustbemorethanorequalto500.

Page 176: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Spreadsheetfordemandingmother

Noticethatwewanttominimizethetime,sochangetheradiobuttontominratherthanmax.Andwhenyouare typing in the constraints,checkthatthesignoftheinequalityistherightwayround.

Thetheatermanager

Youarethemanagerofatheaterwhichisinfinancialtrouble.Youhavetooptimizethecombinationofplaysthatyouwillputontomakethemostprofit.Youhavefiveplaysinyourrepertoire,A,B,C,D,E.Theyhavedifferentdrawpoints(appealtothepublic)andthereforeticketpricestomatch.Adrawpointishowattractive theplayistothegeneralpublic.Thedatalookslikethis:

Play Draw Ticket

A 2 20

B 3 20

C 1 20

D 5 35

Page 177: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Play Draw Ticket

E 7 40

Youhavetheseconstraints:youhaveonly40possibleslots.Andnoplaycanbeputonlessthantwiceormorethantentimes(givestheactorsareasonableturn).Eachperformance costs $10 per ticket sold, regardless of the ticket price (covers theelectricityetc).Thetotaldrawpointshastoexceed160tokeepthecriticshappy.

Howdoyoudistributeyourperformances?Whichcombinationofplaysproducesthehighestprofitandsatisfiestheconstraints?

Wearetryingtomaximizeprofit,solookforanobjective function thatdoesjustthat:20A+20B+20C+35D+40E.Butwait—-wehavethe$10cost.Sobettertakethatoutfirst,leaving10A+10B+10C+25D+30E.

Withthese constraints:

Thedrawpoints(theattractivenessoftheplaytocritics):2A+3B+1C+5D+7E>=160

andnoplaycanbeputonmorethantwiceormorethantentimes.A<=2andA<=10B<=2andB<=10C<=2andC<=10D<=2

andD<=10

andwehaveonly40slots,soA+B+C+D+E<=40

Page 178: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Spreadsheetfortheatreproblem

Moreworked examples

19thcenturyfarmer

Iama19thcenturyEnglishfarmer.I cangrowwheatorbarley.Wheatyields10bushelsanacre,barley8bushelsanacre.Thepriceofwheatistenshillingsabushel,barley5shillingsabushel.The labourcostsforwheatare3 shillingsanacre, forbarley2shillingsanacre.Thetransportationcosttomarketforwheatis1shillingabushel,forbarley½shillingabushel.Ihave100acres.(abushelisameasureofvolume;anacreisameasureofarea;ashillingisacurrencyunit).

Questions:HowdoIsplitupmyland?

IfIcouldgetonemoreacreofland,howmuchwouldthatbeworthtome?

Yetanotherfarmingquestion(Iusethesebecausemostpeoplecanimaginefieldsofland,cropsgrowingandthelike.Butofcoursethesame techniquesareapplicabletootherbusinesssituations).

Page 179: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Afarmerhas100acresofland.Hecanplantcropsorraisesheep. Eachhectareofcropsprovides$100butrequireslaborof$50peracre.Hecanspendamaximumof$500oncroplabor.Eachacreofsheepprovides$40butneedsonly$10 in laborcharges.Thelaborbudgetisunlimited.

Worked solution

CallareaincropsXandin sheepY.ThenX+Y=<100 and50X<= 500Theobjectiveis100X-50X+40Y-10Ysimplifiesto50X+30Y.MyExcelspreadsheetisbelow.

Spreadsheetfortheaboveproblem

Page 180: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 181: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

11.Morecomplexoptimization

Optimization using Solver is a powerful method of solving common businessresource-allocationproblems.Inthepreviouschapter theproblemswererelativelysimpleconcerning theallocationof land ortime.Butlinearprogrammingcanbeusedtosolvemorecomplexandworthwhileproblemsaswe’llseebelow.Thekeyrequirement isthattheanalystisabletodefinetheproblemasasetofequations.Thereisnoonemethodexceptforthinkingcarefullyandwritingouttheproblemasasetofequations.

Todemonstrate,we’llworkthroughthreedifferenttypesofprob-lem:

*problems concerning proportionality: you need to allocate money to differentinvestmentswhileminimizing riskandkeeping returns above a certain amount.Whatproportiondoyouputineachinvestment?

*supplychain problems

*blendingproblemswhereyouneedtomixtogetherinputsfromdifferentsources

Page 182: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

11.1Proportionality

Investmentdecisions example

Thisexampleshowshowwecan‘weight’theinputsaccordingtosomecriteria,inthiscasetheirrisk.Wewanttominimizetheriskbutensurethatthereturnisabovesomeminimumlevel.Whatisthemixtureorblendofinvestmentsthatcandothat?

107

Page 183: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

The problem: you have an inheritance of $300,000 from an uncle but there aresomerestrictions:youmustinvestallthemoneyinfourfunds;yourannualreturnhastobeatleast5%.Andyoumustminimizeyourrisk.Thefourfundsyourunclehasspecifiedare:

Fund Risk Return

X1 10.7 4.2

X2 5 4

X3 6 5.6

X4 6.2 4

Let’sdealwiththeobjective function first.That is tominimize the risk.Wecanweightthesizeof the investment ineach fundwith its risk index.So,weightingeachinvestmentandthendividingbythe totalinvestmentgivestheamountoftherisk:whichisexactlywhatwewanttominimize.SowewanttominimizeR(forRisk)likethis:

R=10 . 7 X 1+5 X 2+6 X 3+6 . 2 X 4

300 ,000

Ifyoudon’tgetthis,thinkaboutwhatwouldhappenifall thefundswereasriskyasX1:thetotalriskwouldincrease.Anotherwaytothinkofthis:ifwedecreasethenumberofsharesinX1andinsteadincreasethenumberofX4,whatwillhappen?Theriskwilldecrease.

Theconstraints:thetotalinvestmenthastoaddupto$300,000,sothisconstraintis

X 1+X 2+X 3+X 4=300 ,000

Wealsohavetoachieveareturnofatleast5%(noeasymatter these days).Thisconstraintis

4 . 2 X 1+4 X 2+5 . 6 X 3+4 X 4 ≥5

Page 184: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Belowismyspreadsheetfromthisproblem:

Investmentblending

Page 185: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

11.2Supplychainproblems

Workingouthowmuchtoshipfromproductioncenterstodemand locations is acommonprobleminsupplychainoptimization.AsIhavebeenstressing,thekeytosolvingproblemsofthistypeistowriteouttheequationswhichdefinethemodel.Hereisanexample:

You runacompanywhichhas bakeries in locationA and B. The bakeries shipcartons of bread to your retail stores at locations X, Y, Z. The bakeries havedifferent capacities and each retail store has different demands. The costs ofdeliverypercartonfromeachbakerytoeachstoreisbelow

Bakery X Y Z Capacity

A 12 13 11 150

B 9 17 17 200

Demand 50 100 90

Page 186: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Notation:adeliveryfrombakeryAtolocationXisAxandsoforth.

Theobjectivefunctionistominimizethedeliverycosts.Sowewant tominimize:12Ax+13Ay+11Az+9Bx+17By+17Bz

Eachbakeryhasafixedcapacity,soAx+Ay+Az<=150andBx+By+Bz<=200

Supplyhastoexactlymatchdemand

Ax+Bx=50Ay+By=100Az+Bz=90

Noticethestrictequalitysign.Myresultsarebelow:

Distributionproblem

Page 187: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

11.3Blendingproblems

Abovewediscussedlinearprogrammingmodelswhichwere simplebuteffective.There are other types of Optimization model which are helpful, especiallyBlendingModels,whichcanalsobesolvedbylinearprogramming.

Blendingmodelsareusedinsituationswherewehavetwoormoreinputswhichhave tobemixed to some formula. Through Optimizationwecanfind themostprofitablemixture.Wine,metals,oil,sausages,recycledpaper—-thisisapowerfultechnique. Youcouldprobablyuseitformarketingcampaigns.We’llworkthoughanexamplewhichisforoil.

Page 188: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Theoilblendingproblem

Theproblem:anoilcompanyhas15000barrelsofCrudeoil1and20000barrelsofCrudeoil2onhand.Thecompanysellsgasolineandheatingoil.TheseproductsaremadebyblendingtogetherCrudeoil1andCrudeoil2.EachbarrelofCrudeoil1hasaqualitylevelof10,andeachbarrelofCrudeoil2hasaqualitylevelof5.Thegasolinethatweproducemusthaveaqualitylevelofatleast8.Theheatingoilmusthaveaqualitylevelofatleast6.Gasolinesellsfor$75abarrel,heatingoilfor$60.Howcanweblend theoils together in such a way that meets minimum qualityrequirementsandmaximizesprofit?

Oilblendingsolution

First,let’sthinkthroughwhatthedecisionvariables(whatgoesintothechangingcells)mightbe.Youmight very well think (as I did first off) that the decisionvariableswouldbetheamountsofthetwooilsusedandtheamountsproduced.Butthisisn’tenough:wehavetoblendtogetherthetwotypesofoil.Theyhavetobemixedbeforetheycanbesold,andthemixturehastoreachsomeminimumqualitystandard.Thecompanyneedsablendingplan.

Theinputs :

selling prices (here gasoline = $75, heating oil = $60) availability of oil fromsuppliersqualitylevelofcrudeoils:Crude1is10andCrude2is5

Theconstraints

Gasolinequality>=8Heatingoilquality>=6QuantityofCrude1

=15000QuantityofCrude2=20000

Theblendingplan :

Gasolinehastohaveaminimumqualityofatleast8,andheating oilmusthaveaminimumqualityofatleast6.TheCrudeoil1we

Page 189: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

haveonhandhasaqualitylevelof10whileCrudeoil2hasaqualitylevelof5.Wewanttoblendthesetwocrudeoilstobothachievetheminimumqualitystandardsandmakethegreatestprofit.

Let’sattacktheproblembycreatingtotal‘qualitypoints’(QPs)whichrepresentthequalityofoilinabarrelmultipliedbythenumberofbarrelsofthatoil.

Writeequationstocalculatethequalitypoints:

TotalQPsinthegasoline=10*amountofOil1+5*amountofOil2

Ifforexample,wemixedtogether50barrelsofCrudeoil1and40barrelsofCrudeoil2,thetotalQPswouldbe50x10+40x5=700.Theaverageperbarrelwouldbe700/90=7.78.Thisistoolowforgasoline(needs8)butacceptableforheatingoil.Twopoints:

we could sell the oil as heating oil, but it is exceeding the minimum qualityrequirement.Wecouldmakemoreprofitbyreducing thequalitytotheminimum,orchargeapremium.Butthisisprescrip-tivework,andwehavetoworkwithintheinputsgiventous).

TheonlywaytogettheoiluptogasolinestandardsistoincreasetheamountofCrudeoil1inthemixture.

Ifyoudon’tgetthis,tryathoughtexperiment:forgasoline,iftherewasnooilatallfromOil2,whatwouldbetheQP?Itwouldbe10*thequantityofoilfromOil1.Again,howaboutifweblended together1000barrelsfromeachtypeofoil: howmanyQPswouldbeproduced?Itwouldbe10*1000+5*1000=15000.Readthisthroughagain…itisimportant.

The blending plan provides us with the constraints we need to ensure that theminimumqualitylevelsareachieved.Intheexample justabove,weblended2000barrelstoprovideaQPof15000.Thisisanaverageof7.5.Goodenoughforheatingoil,notgoodenoughforgasoline.

Page 190: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Oilproblemsolution

Discussionofthesolution:notethatgasolinesellsformoremoneythanheatingoil,buttheoptimalsolutionsuggeststhatweshouldsellmoreheatingoilthangasoline.Thisisbecauseoftheconstraintsonquality.

Page 191: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 192: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

12.Predictingitemsyoucancountonebyone

Why you need to know this . Many business decisions involve counts: eitherbinary(yes/no)orwithintimeorspace.Itwouldbegoodtoknowtheprobabilityofacertainnumberofcustomerscom-inguptoaservicedeskinacertainlengthoftime;ortheprobabilityofacertainnumberofcaraccidentsatagivenintersection.Wearelooking for a discrete probability distribution; discrete because the number ofoccurrencesisaninteger.

Chapter4onregressionshowedhowtopredictthesizeof anoutcomewhichwascontinuous(moneyortimeperhaps).Thedependentvariable—whatweweretryingtopredict—couldtakeonalmostanyvalue.Nowwewanttopredictprobabilitiesfor the occurrence of an independent variable which is an integer. Below we’llwork through some hands-on applications, with the theory available in theGlossary.

We’llbreakthisintotwoparts:

Theprobabilityofa binaryoutcome .Theprobabilitiesoftwofaultyitemsoutofthe next twenty on the production line. Or, the probability of at least fivecustomersoutof thenext fiftyactuallybuyingsomething.Thisisestimatedwiththebinomialdistribution .

The probability of a particular count of occurrences over time or area. TheprobabilityofthreeorfewerpeoplearrivingatyourCustomerServiceDeskwithinthenexthalfhour.Ormorethan twomistakesinthenexttenlinesofcode.ThisisestimatedwiththePoissondistribution .

114

Page 193: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 194: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

12.1Predictingwiththebinomialdistribution

Afterthenormaldistribution(seetheGlossaryforadefinition), thebinomialisthemostimportantinstatistics.ThemathforthebinomialdistributionisalsodefinedintheGlossary.

The binomial distribution provides the probability of a ‘success’ in a certainnumber of ‘trials’. For example, you can calculate theprobabilityofmore thansevenoutofthenexttwentypeople through thedooractuallybuying something.Herea‘success’issomebodybuyingsomething;whilethenumberof‘trials’isthenumberofpeoplecomingthroughthedoor(hereitistwenty).

Somedefinitions:

Whatwearelookingforistheprobabilityofapre-definednumberof‘successes’.Notethatthedefinitionofsuccessisuptotheanalyst. Itcouldbe ‘spendingmorethan$20’orwearingahat.

Intheexamplewe’llworkthroughbelowyourunastore.Youknowthatthelong-runprobability of somebodybuying something is 0.6. You obtained this number bycountingtotalnumbersofcustomersand,outofthose, thenumberswho actuallyboughtsomething(successes).

Workedexample:Youwanttoknowtheprobabilityofexactlythreepeoplethroughthedooroutofthenextfivebuyingsomething.Note that‘success’justmeansthatthedefinedeventhappens.Whetherornotitisa‘goodthing’isn’trelevant.

ForExcel,theargumentsrequiredare: therandomvariablewhose probabilitywewant to predict; number of trials, the long-run probability, and a true/falsestatement.Intheexample, thenumber of trials is five.The randomvariable (X)whoseprobabilitywewanttopredictis3.Wealsoneedthelong-runprobabilityofasuccess.Intheexample,p=0.6.Wealsowanttheprobabilityofexactly3,sousethefalsestatement.I’llgointothisinsomedetailshortly.

Page 195: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Open an Excel spreadsheet, and type in =BINOM.DIST(3,5,0.6,false) and youshouldget this:0.3456.This is theprobabilityof exactly threepeopleoutof thenextfivebuyingsomething(numberofsuc-cessesoutofthenextfivetrials).NoticetheorderoftheargumentsintheExcelfunction.Andespeciallythelastone,whichintheexampleaboveisfalse.(Thealternativeistrue).Thedifferenceisimportantbecausetheresultsarequitedifferent.

Trueandfalseargument .Definingtheargumentasfalseprovidestheprobabilityof exactly the random variable. Defining the argu- ment as true provides acumulativeprobability.

Excel adds up probabilities from the left, so changing the argu- ments to=BINOM.DIST(3,5,0.6,TRUE)=0.66304isthesumoftheprobabilitiesofX=0+X=1andsoonuptoanincludingX=3.Theprobabilityof0.66304isthereforetheprobabilityofthreeorfewercustomersbuyingsomething.ThetablebelowshowstheprobabilitiesofvariousvaluesofX,both‘false’and‘cumulative’.Asyoucansee, thecumulative is just thecontinuedaddition of eachsuccessiveprobability.Thereisafurtherexamplehere¹

Probabilitiescalculatedbothfalseandtrue

Howaboutmorethanthreecustomersbuyingsomething?Inmathnotation,we’relookingforP(X>=3).Weknowthat probabilitiesmustsumupto1.Weknowthatthreeorfeweris0.66304.Somorethanfivehastobe:1-0.66304=0.33696.

¹https://www.youtube.com/watch?v=oZ1DmNQ8wW4

Page 196: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Aslightlyharderexample:theprobabilitythatatleastfourcus- tomersoutofthenexttenwillbuysomething?We’relookingfor P(X >=4). Look carefully at thenotation.X>=4impliesthatthedistributionofthetencustomersissplitintotwohalves:lessthanfourandfourormore.Itisthelatterhalfwhoseprobabilitywe’relookingfor.

Ifwecanfindtheprobabilityof0+1+2+3,thatistheprobabilityoflessthanfour(weonlywant integers here). So find that probability and then subtract from 1,makinguseofthefactthatprobabilitiesmustsumupto1.

Let’sdothisstepbystep.

First,findP(X=<3),thatistheprobabilitythatXisthreeorless:

=BINOM.DIST (3,10,0.6,true) = 0.054762.Notice the ‘true’ which givesusthecumulativeprobability.

Wewantfourormore,sosubtractfrom1likethis:1-0.054762=0.945238.

Anotherexample.It’swinterandyouneedtowearasweatereveryday.Youhavetwoblueandthreeredsweaters.Calculatetheprobabilitythatduringtheweekyouwillweararedsweater:

Exactly twiceintheweek

Answer:thelong-runprobabilityofpickingaredsweateris3/5

=0.6becauseyouhavefivesweaters,andthreeofthemarered.Thenumberoftrialsis7becausethereare7daysinaweek.Thewordingof thequestioncontains theword ‘exactly’ whichmeans thatwe don’t want a cumulative answer, so we’llinclude the FALSE argument. Therefore the answer is =binom.dist(2, 7, 0.6,FALSE)=0.077414

Morethanthreetimes.Whenyouseewordssuchas‘morethan’ that’saclue thatyou’relookingforacumulative probability.In math notation, we’re looking forP(X>3). So if we find the cumulative probability up to and including two, andthensub-tractfromone,we’redone.Thecumulativeprobabilityoftwo

Page 197: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

or fewer is P(X=0) + P(X=1) + P(X=2). So the answer is = 1 -binom.dist(2,7,0.6,true)=0.903744

Thekeys:

Writeoutwhatyouaretryingtopredictinmathnotation.Thisforcesyoutobeclear.Drawalittlesketch(hopefullybetterthanmine!)ifyougetconfused.

Ifthewordingofyourproblemcontains‘exactly’orrequirestheprobabilityofjustoneparticularoutcome(egP(x=3))thenyouwanttousefalse.

Page 198: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

12.2PredictingwiththePoissondistribution

Thebinomialdistributiongaveustheprobabilityofabinaryoutcome(yes/no)outof a certain number of trials. The Poisson gives us the probability of a certainnumberwithinaspecifiedtime-frameorarea.

Here’stheexample.You run theCustomerServiceDesk.Youwant to know theprobabilityoffiveorfewercustomersarrivinginthenexthalfhour.Thatwouldbeusefulforstaffing,wouldn’tit?Youknowthattheonaverage,20customersarriveeveryhalfhour.Youknowthatbecausehavecountedthem.

Wewant:P(X>=5).Likethebinomialabove,thePoissonhasatrue/falseargumentforwhetherwewantcumulativeprobabilitiesor‘exact’.Thenotation include a>sign, which implies cumulative, so we use TRUE. In Excel:=POISSON.DIST(5,20,true).Theanswer is7.19088E-05.This looks a bitweird,butitisjustmathnotation.TheE-05meansthatyoushouldmovethedecimalpoint5placestotheleft.Sotherearefourzeroesinfrontoftheleading7.Inotherwords,anextremelysmallprobability.Perhapssendsome staff out for lunch? It is verysmallbecause5isalongwayfromyourlongrunaverageof20.

Page 199: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 200: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

13.Choiceunderuncertainty

Virtually all business decisions involve at least some degree of uncertainty. Forexample,youdon’tknow(andpresumably canneverknow)somefuture‘stateofnature’whichmightbeanexchangerateorbusinessrental.Afarmerdoesnotknowtheprice of wheat at harvest time, but has to decide howmuch to plant manymonths ahead. His decision is relatively simple compared to a decision-makerconfrontedwithanopponentwhowill reacttotakeadvantageofwhateverdecisionyoudomake.Thefirstsituation iscalled‘non-strategic’and iseasilyhandledbytheexpectedmonetaryvaluemethodwhichwe’llworkthroughinthechapter.Thesecond‘strategic’situationismorecomplicatedbutcanbesolvedbyanapplicationofgametheory.We’llcovertheEMVmethodinsomedetailbelow.Gametheoryisanenormoussubjectandwedon’tcoverithere,butI’llgiveanexampleattheendofthechapteralongwithsomerecommendedreading.

Page 201: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

13.1Influencediagrams

Aninfluencediagramisasketchoftheinfluencesandoutcomesofadecision.Theinfluencediagramallowsustothinkthrough anddepict the forces that influenceourchoiceswithouthavingtoassignprobabilities.Itiscustomarytouseovalshapesforvariableswhichareuncertain,aboxforadecisionnode;andlozengesfortheoutcome.

119

Page 202: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

VacationInfluencediagram

Youareanoutdoors-typeandsothechoiceofactivityisaffectedbytheweather.Theweatherforecastisaffectedbytheweathercondition (onehopes!).Butyoudon’tknowtheactualweathercondition,justtheforecast(whichiswhyitisaforecast).Youchooseyouractivity(box)basedontheforecast.Theamountofsatisfactionyouget (lozenge) depends on the activity you chose AND the actual weathercondition:observethetwolines leadingtothelozenge.Thereisastandardformat:

1.DecisionNodesarepresentedinrectangles.Intheexampleontheslides,thedecisioniswhatactivitytopursuewhenonvacation(climbamountain?readabook?).Whichchoicewemakemaytosomeextentbedeterminedbytheweatherforecast.

2. UncertaintyNodes are presented in ovals. The weather fore- cast isuncertainandsoistheactualweatheritself.

3.OutcomeNodesarepresentedinlozenges.Theoutcomeinthevacationexampleishowmuchsatisfactionyoutookfrom thechoiceyoumadeforvacationactivity.Thisisautility.

Page 203: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

4. Arcs display the flow of the influence. The actualweather conditionaffectstheweatherforecast,whichaffects yourplanning.Noticethatyouare takingadecisiononactivity in advance of knowing what the actualweatherwillbe.Thatiswhythereisnoarcbetweenweatherconditionandvacationactivity.

YoucandrawinfluencediagramsquiteeasilyinExcelusingtheshapesdrop-downtool.

Therearedifferentoutcomeshere,whichwecanbe represented as payoffs.Thehighestpayoffcomesfrommatchingyourchoiceofactivitytotheweatherforecastandthenfindingthattheforecastmatched reality.

Page 204: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

13.2Expectedmonetaryvalue

theexpectedmonetaryvalue,orEMV,isjustthevalueofsomeout-comegivenaparticularstateofnature,multipliedbytheprobabilityofthatstateofnature.Here,Iam using the customary expression ‘state of nature’ not necessarily to refer toanythinginthenaturalworld,buttoaparticularsetofconditions.

Thestateofnaturehastobemutuallyexclusivefortheprobabilities towork.Forexample, if we have separate probabilities for rain and sunshine, then cannotcalculateforrainANDsunshine.Thepayoffswhichresultfromeachstateofnaturearedifferent,dependingonthestateofnature.

Let’smotivatethiswithanexampleusingafriendlyfarmerasthedecision-maker.Hehasthechoiceofplantingwheat,raisingcattle,orsomemixtureofboth.Thosethreepotentialdecisionsrepresenthisrangeofchoices.Fromexperience,heknowshowmuchhecanexpecttoreceivedependingontheweatheratthetimeofharvest.Thatamountiscalledthepayoff .

Page 205: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Hefacesthreestatesofnature,representingdifferentweatherconditionsatharvest.Theseare rain,clear,andsunny.Foreach choiceand foreachstateof nature heknowsthepayoffswhicharerepresentedinthecellsbelow.

Thepayofftableforthefarmer

Therearethreedecisionsandthreeoutcomes,makingatotalof9possiblepayoffs.Payoffscanbenegative,andarenotalwaysintermsofmoney.Theycouldbetime,oranyother appropriate metric. In the rowswe put the choices available to thedecision-maker.Inthecolumnsweputthe‘statesofnature’whicharethe futureeventsnotunderthecontrolofthedecision-maker.

The farmer calculates the probability of each of the three outcomes by workingthrougholdweatherrecords.Hefinds:

probabilityofrainingis0.4probabilityofclearis0.3probabilityofsunnyis0.3(ofcoursetheseprobabilitiesmustalladdupto1.Therecouldbemanymore‘statesofnature’butIhavekeptthemtothreeforclarity.

TheExpectedValueisthesumofeachprobabilitymultipliedby theoutcome. Inmathnotationthisis:

EMV=∑p x∗x

which inwords is: the expectedmonetary value is the sumof all the outcomesmultipliedbytheirprobability.InExcel,usetheformula

=sumproduct(array1, array2). This is a mean, or average, with each outcomeweightedbyitsprobability.Indecisionsinvolvingmoney,

Page 206: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

weusuallycalltheExpectedValuetheExpectedMonetary Value(EMV).Forthefarmer,theEMVsareasbelow:

TheEMVsforthefarmer

Imultiplied(horizontallybydecision)thepayoffbytheprobability ofthatpayoffandthensummed.Usuallywechoosethedecision with thehighestEMV, so thefarmershouldchoosewheat.ExpectedvalueYoutube¹

Note that a ‘good’ decision is making the best decision at the time with theinformationtohandatthattime.Theremaybeunluckyconsequencesbutprovidedtheanalysthasdoneathoroughjobinselecting theoptimaloutcomeat the time,thensheshouldnotbeblamed!

Itwillbehighlyunusualforanoutcomeof30toactuallyoccur.TheEMVisjustaweightedaverageandnotaprediction.

Page 207: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

13.3Valueofperfectinformation

The farmermight be tempted to ask for advice from a consultant. What is themaximumheorsheshouldpayfortheadvice?Itisthedifferencebetweenthebestcase that the farmer calculates and how much he would receive with perfectinformation.WecalculatethevalueofperfectinformationbycomparingtheEMVofthebestchoiceineachstateofnaturewiththeEMVwithoutinformation.Ifyouknewthattheweatherwouldbepoor,youwouldselectCattle;clearwheat;sunnywheatagain.ThenfindtheEMVwithperfectinformationbymultiplyingthesebestoptionsbytheir probabilities.

¹https://www.youtube.com/watch?v=fR7cBts1C1w

Page 208: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Theworkingis:100.4+300.3+70*0.3=4+9+21=34.Thebestwecouldcomeupwithoutperfectinformationwas30,sothevalueofperfectinformationis34-30=4.Soifsomeoneofferedyouperfectinformationforaprice,thehighestthatyouwouldpayfortheperfectinformationwouldbenotmorethan4.

Page 209: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

13.4Risk-returnRatio

There are other ways of choosing the optimal outcome other than the EMV,althoughtheEMVremainsthemostcommonlyused.OnecommonmeasureistheReturntoRiskRatio ,whichprovidesthedollarsreturnedperdollarputatrisk.Foreachdecision,wedivide theExpectedValueof thedecisionbytheStandardDeviationoftheoutcomeforthatdecision.WeusuallychoosethedecisionwiththehighestRRR,because then thedollar returnforeachdollarputat riskishighest.Thefarmer’sworksheetisbelow.

Mixedhasazerobecausethepayoffsareallthesame.Thereisnorisk.Cattlehasthehighestat0.715.This ishigher than theRRR for wheat althoughwheat has thehighestEMV.Thefarmermightwanttothinkmoredeeplyabouttheprobabilityofsunnyweatheratharvesttime,becausethismakesahugedifference.

Page 210: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

13.5Minimaxandmaximin

A non-probabilistic approach to making choices is the use of maximin andmaximax.Youcan see thatyour choicedependson whether you are pessimistic(maximin)oroptimistic(maximax).Ifwecansomehowobtainprobabilities,thenaprobabilistic approachworksbest.

Page 211: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 212: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

13.6Workedexamples

Youare planning to market a new coffee-flavored drink. You have a choicebetweenpackingthedrinkinareturnableornon- returnablepackaging.Yourlocalgovernmentisdebatingwhethernon-returnablebottles shouldbeprohibited.Thetablebelowshowsyourprofits.Ifthenon-returnablelawispassed,youwillstillgetsomesalesfromexports.

Decision Lawpassed Lawnotpassed

Returnable 80 40

Nonreturnable 25 60

Example1

1. What would be the best decision based on maximin and maximaxcriteria?

2. Alobbyisttellsyouthattheprobabilityofthegovernmentbanningnon-returnablesis0.7.Assumingthelobbyistisright,whatisthebestdecisionbasedontheEMV?

3.Atwhatlevelofprobabilitywouldyourdecisionchange?

4.Howmuchwouldyoupayforperfectinformation?

Answers:

1. Maximin: theworstare25and40.Thebest is40,meaningpackagewithreturnables.Maximax:thebestare80and60.Againreturnables.

1. TheEMVforreturnableis800.7+400.3=68.Fornonreturn-ableitis250.7+600.3=35.5.UnderEMV,youshouldusereturnablebottles.

2. Setthisupasequation,withptheunknownwhichhastobalancebothsides80p+40(1-p)=25p+60(1-p).Solvefor

Page 213: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

pandget0.267.Soifyouhearrumoursthatsuggesttheprobabilityiscomingdown,thinkagain?

3. Ifwehadperfectinformation,theEVPIis800.7+600.3=74.Soyoushouldnotpaymorethan74-68=6

Example2

Yourunabank.Acustomerwantstoborrow$15,000for1year at 10% interest.Youbelievethatthereisa5%chancethatthecustomerwilldefaultontheloan,inwhichcaseyouwillloseallthemoney.Ifyoudon’tlendthemoney,youwillinsteadplacethe$15,000inbondswhichreturn6%butarerisk-free.

1.WhataretheEMVsofloanandnotloan?

2. Youhave a credit investigation department which can help you withmoreaccuracyintheprobabilityofdefault.Whatisthemostyoushouldbewillingtopayfortheiradvice?

3. Calculatethelevelofprobabilityofdefaultatwhichlending themoneyandinvestinginbondshaveequalEMV.

Page 214: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 215: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

14.Accountingforrisk-preferences

What this chapter is about: the previous chapter showed howwe could pick themostattractivechoicegivenpayoffsandprobabil- ities.But itmighthavestruckyouthat therewaslittlediscussionoftherisksinvolvedandhowdifferentpeoplerelate to risk. In this chapter we are going to work through picking optimaldecisions,takingintoaccounttheriskpreferencesofthedecisionmakers.

Hereisaquestionforyou:youaregivenalotteryticketwhichhasa0.5probabilityofwinning$10,000anda0.5probabilityofzero.TheEMVistherefore10,0000.5+ 00.5 = $5000. Someone comes along and offers you $3000 for the ticketguaranteed. Would take the sure thing? Or would you hope that you win the$10,000?Most peopleare‘riskaverse’andwouldprefer togiveupsomeof theEMVinexchangeforcertainty.The$3000(orhowevermuchitis)is calledyourCertaintyEquivalent (CE) in thisparticulargamble.The difference between theEMVandtheCEiscalledtheRiskPremium.So

RP=EMV-CE

Youcanthinkofthecertaintyequivalentasasellingprice.Itisusuallysmallwhenthesizeofthegambleissmall,butincreasesas thegamblegetslarger.HereIamusing the term ‘gamble’ because there are probabilities involved. The certaintyequivalentisausefulconceptwhich,amongstotherqualities,allowsustocategorizeapproachestorisk.

•Risk-averse .IfyourCEislessthantheEMVofagamble,thenyouarerisk-averse (and most people are, so nothing to worry about. You’re‘normal’)

127

Page 216: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

•Risk-neutral .YourCEmatchestheEMV.Youplaytheaver-ages.

• Risk-seeking .Youenjoythegamble inherent in theEMVand need tohaveaveryhighCEbeforeyou’llgiveitup.Mostpeopleinbusinesswhoare risk-seeking usually don’t last verylong.Althoughwesometimesseerisk-seekingdecisionswhenyourbusinessisintroubleandahugegambleistheonlypossiblesolution

Businessdecisionsareallabout risk,and theprobabilities them-selvesareoftenunreliableandhardtofind.Thefarmerwhosecrop- ping decision we studied inChapter2doesn’tknow tomorrow’sweather,letalonetheweatherinsixmonths.WegenerallyprefertogiveupsomeoftheEMVinreturnformorecertaintybecauseweareriskaverse.That’swhywebuyinsurance.IknowthatIwillbecoveredifIsmashmycar,andIcanevenputupwiththeknowledge thatpeopleinZuricharegettingricherbecauseofmyaversiontorisk.

Page 217: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

14.1Outlineofthechapter

Here’swhatwe’regoingtodo:

•Describeutilityandcalculateutilities

•Showhowtousetheexponentialutilityfunctiontocalculateutilitiesbasedonrisktolerance

• Convertthoseutilities to certainty equivalentswhichwe can rank andcomparewiththeEMVsofthesamedecision

Torank choices in terms of their certainty equivalent ratherthan their EMV,weneedtheconceptofutility,whichisa numericalmeasureofhowwellagoodorservice satisfiesaneedorwant.Do you prefer to be reading this book, outsideeatinganice-cream,or

Page 218: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

talkingwithafriend?Youcan rank these choices quickly in your head.Youcansubconsciouslyallocateutilitynumberstothechoicesandthenpickwhicheverhasthehighestutility.

Utility numbers help us to rank our preferences, but there are no units and therankingisindividual-specific.Inotherwords,givenasetofalternatives,differentpeoplemaywell havedifferentrankingsfor them.For themoment, let’sassumethatitisjustyouwhoseutilitiesweareinterestedin.Ifwecansomehowfigureoutyour utilitiesfor thedifferentoutcomesofabusinessdecision,wecanuse thoseutilitiestohelpyoumakeamorepsychologicallysatisfyingchoice.

Utility

Theplotbelowshowsahypotheticalutilitycurve.Theutilityvalueisontheverticalyaxisandtheoutcomeofagambleison thehorizontalxaxis.Noticetwothings:

Atypicalutilitycurve

*Theslopeofthelineisupwards.Thisreflectsthefactthateveryone prefersmoremoneytoless

Page 219: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

• The rate of increase of the line is decreasing or slowing down. Itwassteepearlyon,buttowardstheenditisalmostaplateau.Thisisbecausethevalueofeachextra dollar isslightlylessthanthatof the previousdollar.Thisisthemarginalutilityofmoney.

Page 220: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

14.2Wheredotheutilitynumberscomefrom?

Therearetwoapproaches.

Thefirstinvolvesaskingthedecision-makerhis/herchoicebetweentwogambles.

The seconduses a utility function, amathematical function which converts twoinputvariables(risktoleranceandthesizeoftheoutcome)intooneutilitynumber.

WithautilitynumberinhandwecanbegintheprocessofrankingthedecisionsintermsofutilityratherthanEMV.

Theintuitivemodel

We’llworkthroughtheintuitivemodelfirstandthenapplythesamethinkingtotheequationmodel.Onceyouget thehangof it, theequationmodel is quicker andprobably more accurate. We’ll use the payoff table from the farmer example,repeatedbelow.

Thepayofftableforthefarmer

We’llassignautilitynumberof1tothehighestpayoffandthenzerotothelowest.Thebeginningsoftheutilitytablelookslikethis:

Page 221: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

PayoffUtility

701

40

30

20

10

-10

-300

Nowweneedtofillinthegaps.Whatwe’lldoistouseeachpayoff asacertaintyequivalent,andaskthisquestion:

Whatvalueofpwouldmakeyouindifferentbetweeneitherreceiv-ingaguaranteedpayoffofthecertaintyequivalent(inthefirstcase

40)oracceptingagambleofreceivingthehighestpayoff(70)withprobabilityporlosing30(thelowestpayoff)withprobability1-p.Let’ssaythatyouanswerp=0.9.Theexpectedvalueofthegambleis70*0.9+0.1(-30)=60.

This is higher than your certainty equivalent of 40, indicating that you are riskaverse.Thedifferencebetweentheexpectedvalueandthecertaintyequivalentistheriskpremium,andistheamounttheindividualiswillingtogiveuptoforgorisk.

Continuedownwards,certaintyequivalentbycertaintyequivalentandcompletethetable.I’vemadeupthenumbersjustforillustra-tion.Aplotoftheutilityvaluesisalsoshown.

Payoff Utility

70 1

40 0.9

30 0.8

20 0.65

10 0.4

-10 0.35

-30 0

Page 222: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 223: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Farmer’sutilitycurve

Ifyouwererisk-neutral,youwouldprefertotakethegambleandsowouldbeanEMVmaximizer.Thechoiceoftherisk-neutralperson isindicatedbythestraightline.

Now,let’sreplacethepayoffvaluesinthefarmertablewithutilities.WecancalculatetheexpectedutilitiesinthesamewayaswefoundtheEMVs.Thetableshowsbothforcomparison.Wemultiplytheutilityofeachoutcomewithitsprobability.

Expectedutilities

The famer’s EMV decision would have been wheat (EMV=14)but taking intoaccount his risk preferences, mixed gives the highest expected utility. It makessensestospreadtheriskbetween twocompletelydifferent crops.

##Theexponentialutilitycurvemethod

It is rather tediousaskingsomanyquestions,pluspeopleget tired andconfusedratherquickly.Anattractivealternativeistousea

Page 224: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

mathematicalfunction, theexponentialutilitycurve.Thisrequires onlyoneinputvariable,R,thetoleranceforrisk.Theequationisbelow:

U x=1−e− x / R

Herex is the payoff; Ux is the utility of the payoff x; and R is theindividualsrisktolerance.Apersonwitha large value ofR ismorelikelytotakerisksthansomeonewithasmallerRvalue.AstheRvalueincreases,thebehaviorapproachesthatoftheEMVmaximizer.TheplotforsomeonewitharisktoleranceofR=1000is

below. ActuallyperformingthecalculationstofindutilityateachlevelofpayoffiseasyusingExcel,aswe’llseebelow.ButfirstwehavetodetermineR,theindividual’stoleranceforrisk.

Findingrisktolerance

Therearetwoapproaches

Byaskingthisquestion:consideragamblewhichhadequalchances ofmakingaprofitofXoralossofX/2.Whatisthevalueofx forwhichyouwouldn’tcarewhetheryouhadthegamble.Inotherwords,whatisthevalueofxforwhichthecertaintyequivalentis

Page 225: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

zero?Theexpectedvalueofthealternativeis0.5xX-0.5*(X/2)=0.25X.Aslongas X >0 then the decision-maker is displaying risk- averse behavior, becausehis/hercertaintyequivalentislessthantheexpectedvalueofthegamble.ThegreatertheR,themoretolerantofrisk.Inhisbook,ThinkingFast,ThinkingSlow,DanielKahnemannotes that the ‘2’ in thedenominatorcanvarya tad. It isn’tapreciseformula,butapparentlyitcomesclose:Indiagramform:

Anal- ternativeapproach,moreusedinbusinessandfinance,istoemploy guidelinenumberscalculatedbyProfessorRonHoward¹,apioneerofdecision-analysis.Basedonhisyearsofexperience,HowardsuggeststhattheRvalueforacompanyis:

•6.4%ofnewsales

•124%ofnetincome

•15.7%ofequity

##Exampledecisionusingutilitymaximisation

Whatwe’ll dohere is to examine adecision from both the EMV and ExpectedUtilityangles,using theexponentialutility function. Thecalculationofexpectedvaluesisthesame:thedifferenceisthatinsteadofmultiplyingeachmonetarypayoffbyitsprobabilityandthenaddingup,wemultiplytheutility.

¹https://profiles.stanford.edu/ronald-howard

Page 226: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

The decision set-up. All money figures in millions. I run a coffee importingbusiness.Ineedaspecialimportlicensetobringinaparticulartypeofcoffee.Theequityofmycompanyis12.74.So,myRvalueis15.74%of12.74=2.

Ihaveachoicebetween‘rush’and‘wait’forthepermit.

IfIrush,Ihavetopayaspecialfeeof5,andthereisa50/50chanceofthepermitbeinggranted.Ifitisgranted,Iwillhavesalesof8,givingmeanetprofitof8-5=3.Ifitisn’tgranted,Istillhavetopaythe5,butwillhavesalesofonly6,givingmeanetprofitof2.First,findtheEMVandEUofrushing.TheEMVis0.5x3+0.5(2)=2.5.

Usingtheexponentialcurveformula,theutilityof3is=1-exp(- 3/2)=0.77687.Andtheutilityof2is=1-exp(-2/2)=0.632121.TheEUforrushingis0.5(0.77687)+0.5(0.632121)=0.704496.Allfiguresareinmillions.

Nowforthealternative,whichistowait.Inthisscenario theprobabilitiesarethesameat50/50,butthecostisonly3ifitisissued,nothingifitisn’t.Thesalesarethesameat8and4.Thenetprofitsare8-3=5and4.TheEMVis0.55+0.54=4.5.TheutilitiesareU(5)=1-exp(-5/2)=0.918andU(4)=1-exp(-4/2)=0.865.TheEUis0.50.918+0.50.865=0.892.

ThetablebelowshowstheEMVsandEUs.

Spreadsheetfortherushdecision

Butwecangofurtherusingcertaintyequivalents,subjectofthenext section.

Page 227: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 228: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

14.3Convertinganexpectedutilitynumberintoacertaintyequivalent.

Fortunately there is an easyway to convert expected utilities back to certaintyequivalents. It is then straightforward to rank the decisions by certaintyequivalents. Remember the concept of the certainty equivalent? The amount ofmoneywhichitwould takeforyou to sellyourgamble? It turnsout thatwecanconvertanexpectedutilitynumberintoacertaintyequivalentusinganequation:

CE=− R∗ln (1−EU)wherelnisthenaturallogarithm.Weusethisequationwherewearetalkingprofitsandwantmore.Inthecaseofcosts,takeoffthenegativesigninfrontofR.

AbovewefoundEU=0.704496fortherushdecisionbranch.ThatisaCEof

=-2*LN(1-0.704496)= 2.43814

using Excel’s ln function. For the wait decision, the EU was 0.892, giving acertaintyequivalentof=-2*LN(1-0.892)=4.451248

LargerCEsarepreferredwhentheoutcomesareprofits,andsmallerCEswhentheoutcomesarecosts.SoI’dwait!ThisconfirmsthedecisionmadewiththeExpectedUtilities.

Page 229: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce
Page 230: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

15.GlossaryThisisahands-onguidetodoingsomebasic analyticworkwith data.However,statisticshasitsownsetofvocabularyandtech-niques.Whenyouarepresentingyour results tootherpeople,youmaywish to followestablished techniques andterminology. Also, here you’ll find more of the underlying theory if you areinterestedinhowExcelgetstheresultsitdoes.

Binomialdistribution

The binomial distribution is used to find the probability of the occurrence of acertainnumberofevents(calledsuccesses,althoughtheycouldbeanythingclearlydefined)withinanumberoftrials.Thebinomialdistributionwillgive,forexample,theprobabilityofatleasttenpartsbeingdefectiveoutofthenexthundred,providedthelong-termdefectiverateisknown.Thebinomial requires some conditions toworkproperly.These are:

asequenceofnidenticaltrialstherearetwooutcomesforeachtrial,denotedsuccessandfailuretheprobabilityofsuccessdoesn’tchangethetrialsareindependent.

It isquiteeasytocalculatetheprobabilitieswithExcelbutfor illustrationhereistheequationwhichExceluses:

f( x )=

(n)

x

p x (1 −p )

( n − x )

wherexisthenumberofsuccesses;ptheprobabilityofsuccessononetrial;nthenumberoftrials,andf(x) theprobabilityofx successes inn trials.Thisnotationpointsupthefactthattheresultisafunctionofx.

137

Page 231: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Centrallimittheorem(CLT)

Thecentral limit theoremisfundamental instatistics.What the theoremstates isthis:let’ssayyouhavealargepopulationand you keep taking samples from thepopulationandmeasuring themeanofeachsample.Thenyoudrawahistogramofthemeanssothateachmeanisavariableinthedistribution.Inotherwords,insteadof plotting each observation independently, we plot them by themeans of eachsample.Thehistogramwhichappearswillbeapproximatelynormallydistributed,orintheshapeofthewell- known bell curve. This is wonderful news because thenormaldistributioniswellbehaved,andsowecancalculatetheareaunderanypartofit.

Correlation

Youhaveapairofvariables:forexample:websitehitsand actualsales.Youwantto knowwhether there is a relationship between thevariables.Andwhetheritis‘statisticallysignificant’,orperhapswhatyouthoughtwasarelationshipoccurredjustbychance.

Correlation is the study of the linear relationship between two or more randomvariables.Correlationanalysisprovidesboththedirectionand thestrengthof therelationshipbetweenthevariables.Forexample,thereisarelationshipbetweentheheightandweightofindividuals.There is apparently also a relationshipbetweenconsumptionofsugarandobesityratesatthenationallevel.Cor-relationanalysiscantestwhethertherelationshipsarespuriousorreal.

It’strueandworthrepeating: correlationdoesnotimplycau-sation .Itdoesnotnecessarily follow that onevariable causes the other even though theymight behighly correlated. There are many examples of such spurious correlations, forexamplenational consumptionof chocolate andnumber ofNobel prizeswon. Ifyou do find a correlation, first think about whether you have a lurking variablediscussed below. The two examples given above for height/weight and sugarconsumptionmightverywellinclude

Page 232: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

causaleffects,butthisisnotsomethingthatcorrelationanalysiscanestablishonitsown.Indeed,somephilosophers(such asDavid Hume) claim that causation canneverbedecisivelyestablished.

Nowwewill:

•Visualizeacorrelationwithascatterplot

•explainwhythecoefficientofcorrelationworks

•interpretthevalueofthecoefficientofcorrelation

•testforthestatisticalsignificanceofthecoefficientofcorre-lation

Scatterplotsandcorrelation

Acanningfactoryhasdataonnumberofworkersemployedandoutputofcans.Thedatasetis‘canning’.Thefactoryisinterestedintherelationshipbetweenthenumberofworkersandoutput.LoadthedataintoExcel.

Firstfewlineofthecanningdata

Thecolumnsofdata representsnumbersofworkers andcansmanufactured.Forthenotationwe’llcalltheseXandY.WhenImeanthenameofthevariable,I’llusetheuppercase.Forindividualobservations,suchasx=23,I’lluselowercase.WewanttoseetherelationshipbetweenXandY.TodothisweneedtohavethedatalaidoutincolumnssothateachobservationofXandYareonthesameline.It’salwaysagoodthingtovisualisetherelationshipwith a scatter plot and the scatter plot isbelow.

Page 233: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Thecansscatterplot

I’vedrawnaverticalandhorizontallinewhichgoesthroughthemeansofthetwovariables.Wecandividetheobservationsinto quadrants,withquadrantonebeingtoprightandsoforth.Iftherelationshipispositivethenwewouldexpecttoseemostof the observations in quadrant one and quadrant three. If the relationship wasnegativethenmostoftheobservationswouldbeinquadrantstwoandfour.

TherelationshipbetweenXandYisnotperfectlylinear.Ifitwas all the pairs ofpointswould lineupalonga straight line.Weneedsomewayofmeasuringhowfar off being in a straight line these observations are. There are two measures,covarianceandcoefficientofcorrelation.Thereissomemathsinwhatfollowsbutwe’llwalkthroughitslowly.

Covariance:Trythisthoughtexperiment:ifalltheYpointswereonaverticalline,thenthesubtractionofthemeanfromeachYvaluewouldgiveuszero:

( y i− y )=0

I’veputinalittleifortheyvaluestoshowthatitcouldbeany

Page 234: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

ofthevaluesintheYvariable.SimilarlyfortheXvalues,ifallthepointswereonaverticallinethen

( x i−x )=0

Wecanseethatwedon’thaveverticallinesandthereissomedif- ferencebetweeneachobservedvalueandthemean.Ifwemultiplytogetherthedifferencesandthenfind theaveragebydividingbyn-1wecanfindthecovariance.Wedividebyn-1becauseweareinmostcasesdealingwithasample.Subtracting1adjustsforthisfact.ThesamplecovarianceofXandYistherefore:

S XY=

( x−x )( y−y )

n−1

Thebigproblemwithcovarianceisthattheresultisverydependentontheunits.Wecan get round this problem by standardizing the differences between eachobservationand itsmean.Wedo thatby dividing the difference by the standarddeviation of the variable. You can think of this technique as providing thedifferenceinunitsofstandarddeviations,so

( x−x ¯

s xandthesamefortheYvariable.Nowwenolongerneedtoworryabouttheunits.Thecoefficient of correlation is the covariance now between the standardiseddifferences,not thedifferences them- selves.The equation for the coefficient ofcorrelationofasampleis

r XY=∑

( x−x )( y−y )( n−1) S XS Y

Performingthiscalculationisextremelytediousandweusuallyletthecomputerdo

Page 235: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

thework.Typein=correl(array1,array2)andhit

Page 236: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

enter.Forthecansdatatheresultis0.991083.ThisisthePearsonProductMomentor‘r’value.

CorrelationYoutube¹

Valuesofr:valuesofrarerestrictedtotherange-1to+1.Anegativevaluemeansthat the slope is negative: as one variable increases thentheotherdecreases.Apositivevaluemeansthatbothvariables increasetogether.Thelargertheabsolutevalueofr,thenthecloser thecorrelation.Acorrelationofzeromeansthat all thepointsarescatteredaboutwithnopatternwhatsoever.Thervalueof0.991083forcansmeansthatthepointsarelinedupalongastraightline,whichiswhatwewouldexpectfromlookingatthescatterplot.

Testingr:theobservationsthatwehaveusedtocalculaterconsistofasamplefromapopulation.Becauseitisonlyasample,wedon’tknowwhetherthecorrelationwouldholdup ifweweresomehowabletogainaccesstothepopulation.ris thenotationforthecoefficientofcorrelationforasample.ForapopulationweusetheGreekletterρ“rho”.Generally,Greeklettersareusedfor population parameters.ThisdistinguishesthemfromtheRomanlettersusedforestimates.

Thehypothesiswe’retestingis:

againstthenull

H 0:ρ ≤0

H a :ρ =0

Wefindtheteststatisticusingthisequation

r √ n −2

t =√

−r 2

¹https://www.youtube.com/watch?v=QDj8aYfvccQ

Page 237: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

We’llfirstworkoutthevalueoftheteststatisticusingthervaluewefoundabove.Thenwe’llfindthepvaluefromthis.

Thervaluefoundfromthecorrelationwas0.99and therewere 24 observations(n=24).Plugginginthenumbers,wefindthatt=32.86.Thisisaonetailedtest,souse=t.dist(yourvaluefort,n-1,andfalse) tofindapvalueof2.67343E-21.The -21meansmovethedecimalplace21placestotheleft,sothisnumberis effectivelyzero.Thervaluewefoundisstatisticallysignificant.Itismuchsmaller than0.05,sowecanrejectthenullhypothesis.Thereisacorrelation.

Descriptivestatistics usesgraphsand tables topresent whatweknowabout thedata. This is a powerful method of gaining insights into the data, and also formakingpresentations.However,descriptivestatisticslimitsitselftothedataavailabledoesnotmakeanyinferencesbeyondwhatwehavefromthedatatohand.

Distribution of the observations within a dataset describes the density of theobservations:aretheyspreadaboutevenly,orpeakedinthemiddleandthenspreadoutoneithersideoftheirmean?

One of the most common and useful distributions is the normal distribution,sometimesknownas thebellcurve.Manythings in thenaturalworldfollowthisdistribution.Forexample,ifyouwereabletomeasuretheheightsofalargenumberofmenorwomen,you’dfindthattheyfollowedanormaldistribution.

TheplottotherightshowsthedistributionofrecordsofheightsofparentscollectedbyFrancisGaltoninthe19thcentury.Galtonwastryingtofindouttherelationshipbetweentheheightofparentsandtheirchildren.Alongtheway,hediscoveredthephenomenonknownas‘regressiontothemean’.

Ihaveplottedtheheightsasahistogramandaddedanormalcurvewiththesamemeanandstandarddeviationastheoriginaldata.Thisclearlyfollowsthebellcurve.

Thenormaldistributionisveryimportantinstatisticsforseveralreasons

Page 238: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

-thenormaldistribu-tionis‘well-behaved’inthesensethatital-waysfollowsthesamemathematicalfunction.Theequationisalit-tlecomplex,butwecandrawanynormaldistributionprovidedweknowitsmeanandvariance.Themeanistheaverageoftheob-

Galton’sheightobservations

servations,andinanormaldistributionoccursrightinthemiddle.Thevarianceisameasureoftheaveragedistanceofeachobserva-tionfromthemean.Thelargerthevariance,themore‘spreadout’ordispersedthedata.

Thegraphshereshowtwodatasets,withidenticalmeans butdifferentvariances.In most of this book we won’t use the term variance, but instead standarddeviation .Thestandarddeviationisjustthesquarerootofthevariance.Weusethestandarddeviationprimarilysothatwecanusethesameunitsasthemean.Thefirstplothasastandarddeviationof5,andthesecondoneastandarddeviationof2.Theplotswerecreatedbygeneratingarandomsetof1000numbers,withameanoftenandthestandarddeviationsjustdescribed.

-the second reason is connected with the most important result in statistics,the CentralLimitTheorem ,orCLT.This statesthatifyoukeepdrawingsampleswithreplacementfromalargepopulation, themeans of the samples will followa normaldistribution .Theimportantthingisthatthedistributionoftheoriginalpopulationdoesn’tmatteras longaswedrawat least30samples.Infact, itgetsbetterthanthat:iftheoriginalpopulationisnotnormallydistributed,thenwedon’tneedasmanyas30inoursample.Asfewas10willdothetrick.

Page 239: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Standarddeviation=5

Becausethenormaldistributionfollowsanequationsowell,wecan find the areaunderanypartofitveryeasily.

Standarddeviation=2

Themean splits the area under the curve into halves,meaning that50%of theobservationsaresmallerthanthemean,and50%larger.Acommonpracticeis tomarkoff thehorizontalaxis inunits of standard deviation. The further from themean,thegreaterthestandarddeviation.Thishelpsbothinvisualizingtheshapeofthedata,butalsoinusingtheempiricalrule .

Page 240: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Empirical rule : a ‘rule of thumb’ which states that when data are normallydistributed,

-64%ofthedataarewithinonestandarddeviationofthemean-95%ofthedataarewithin twostandarddeviations of themean -99.7% of the data arewithin threestandarddeviationsofthemean

Thismakesintuitivesense:only100-99.7=0.3%ofthedataaremorethanthreestandard deviations from the mean. They would therefore be in either of theextremeendsortailsofthedistribution.Theprobabilityoftheiroccurrencewouldbeverysmall.Bycontrast,observationsclosertothemean, those located onlyasmallnumberofstandarddeviationsfromthemean,aremuchmorelikelytooccur.Thatiswhythehistogramishighestonatthemean.Thoughtexperiment:veryshortorverytallpeopleareunusual(whichiswhyyoustareatthem).Bycontrast,peoplearoundthemeanheightarenotatallunusualandyoujustpassthemby.

Estimation : The statistical process is circular: we start off by choosing thecharacteristic of the population that interests us. That characteristic is called aparameter.Wedrawarandomsamplefromthatpopulation.Weuse thestatistic toinfer information about the same quantifiable feature in the population. Thestatistic(for ex-ampleanaverageoraproportion)isanestimateofthepopulationparameter. The process of finding how close the estimate to the parameter, andbuildingaconfidence interval around the estimate, is inference .We test claimsabouttheparameterusinghypothesis-testing,whichiscloselyrelatedtoinference.Hypothesis-testingisthefinalpartofthischapter.

Excel –morescreencasts

Insertingthedataanalysistoolpakadd-in²Thenormaldistribution³

²https://www.youtube.com/watch?v=Ods1Z5BLD9g&list=UUVZFhqXDgJgIRq-ljotClgQ

³https://i1.ytimg.com/vi/1qn0EIm5-PQ/mqdefault.jpg

Page 241: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

ExponentialSmoothing

Hypotheses and p values A hypothesis is a claim someone makes about apopulation.Forexample,acoffee importer (me!) is toldby the supplier that themeanweightofthesacksofcoffeebeansis40kg.Thatistheclaim.Iwanttotestthat claim because I am paying for the coffee. All the sacks of coffee in theshipmentarethe‘population’.Itakearandomsampleofsackstocheck.Ifind thatthemeanweightofthesampleis38kgs.DoIrejectthewholeshipment,orsayImusthavegot a low sample, let’s take them? The hypotheses here are: thenullhypothesis:thepopulationmeanweightis40kg.Meanwhilethealternativeothepopulationmeanweightisnot40kgs.Hypothesistestingprovidesananswertothetestintheformofapvalue,definedbelow.

ApvalueistheprobabilityoffindingasamplewiththemeasuredcharacteristicsIFthe null hypothesiswas true. In the case of the coffee above, let’s say that wecalculatedthepvalue(SeeChapter

9)anditwas0.1.Thatmeansthatthereisa10%chanceoffindingasamplewithameanweightof38kgsiftherealtruemeanweightoftheshipmentwas40kgs.Themostfrequentlyusedcut-offpointis0.05or5%.Ifthepvaluewassmallerthan0.05thenwewouldsaythatthereisalessthan5%chancethatthissamplecamefromapopulationwiththeclaimedweight.Therefore the shipment isn’t of the claimedweightandshouldberejected.

Hypothesiswritingandtesting

Ahypothesisisastatementorclaim.Forexample,‘themeanweight of thecoffeepacksis3kg’isahypothesisorclaim.Thehypothesisneedstobetestedsothatweknowwhethertheclaimstandsorisshowntobefalse.

Hypothesesarewrittensothattheclaim,andthecounter-claim,aredistinct.Therearetwohypotheses,onefortheclaimandthe other for the counterclaim.Ho, the‘nullhypothesis’,containstheclaim.Ha,the‘alternativehypothesis’,containsthecounterclaim.Wewritehypothesesbyassumingthatthenullhypothesisistrue.

Page 242: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

Thenullhypothesesforthecoffeeexampleabovewouldbe

Ho :µ =3 kg

Thisisreadas:Thenullhypothesisisthatthepopulationmeanweight(mu)ofthecoffeeisequalto3kg.

Thealternativehypothesis (Ha) is that thepopulationmean weight is not 3 kg,written:

Ha :µ =3 kg

Wetestthehypothesesusinginformation(statistics)fromthesam- ple (themeanweight,thesamplesize,andthestandarddeviation).Notethatthecounterclaimsaysnothingaboutthemeanweightinthepopulationbeingmorethanorlessthan3kg,itsimplysaysthatitnotequalto3kg.Also,important,theequalitysignappearsinthenull.Thisisalwaysthecase,whethertheinequalityis‘strict’or‘weak’.

Wecanalsotestwhetherthepopulationmeanissmallerthanor largerthansomeconstant.Theshipperclaimsthattheweightis‘atleast’3kg.Wewritethisas:

H 0:µ ≥3 kg

Notethattheequalityremainswiththenull.Thealternativeisthen thenegationofthenull,andthismustbethattheweightis‘lessthan3kg’.

H a :µ<3 kg

Convinceyourself that theonlypossiblewaytonegateHoiswith thealternativehypothesis.Iftheclaimwasthatthattheaverageweightislessthanorequalto3kg,wesimplyswitchoverthe

Page 243: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

inequalitysigns.Thereareotherformsofhypothesiswritingwhichwewillworkthroughinthisbook.However,theyallsharetherulethattheequalitysigngoesinthenull;andthealternativenegatesthenull.

Inference is thekeystatisticalprocess. It is theway in which information aboutpopulationscanbegleanedfromsurprisinglysmallsamples.Anexampleispollingbeforeanelection.Providedthesampleisdrawnrandomlyandisrepresentativeofthepop-ulation,asampleofperhapsonly1,000peoplecanprovidequiteaccurateestimates of the proportion of the population predicted to vote for a particularpoliticalparty.Theestimatesareusedtomakeinferencesaboutthepopulation.Wewon’tknowuntilelectiondaywhetherornottheinferenceswerecorrectornot.

Inferentialstatisticsisaverypowerfultechniquewhichallowsustomakethejumpfromasampletoapopulation.Wecanuseasmallsampletomakeinferencesaboutthecharacteristicsofapopulationaboutwhichweperhapsknowverylittle.Thekeyistoobtainasamplewhichrepresentsthepopulationasaccuratelyaspossible.Forthis, both randomization and good surveydesign are essential. Wewill discussthesetwoimportantissueslaterinthechapter.Recently,athirdclassofstatisticalmethodologieshasarisen.Thisismachinelearning ,whichtakesadvantageofnewanalytictechniquesandgreatercomputingpower.Wewon’tbeable togointothesetechniquesinthisbook,unfortunately.

Lurkingvariables : Ithappens that there isclosemonthlycor- relationbetweentheconsumptionoficecreamanddeathsbydrowning.Isitthatpeopleeatanicecream,goswimmingandthendrown?Thecorrelationmissesthelurkingvariablewhichistemperature.Inhotweather,peoplebothgoswimmingmoreoftenandbuyice creams. The ice cream and the drowning variables are both correlated withtemperatureandsoarecorrelatedwitheachother.Anotherexample:itseemsthatthereisacorrelationbetween kidsneeding readingglassesandsleepingwith thelighton.Soyou

Page 244: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

should switch the light off? Not so fast. The lurking variable is the short-sightednessoftheparents,notthekids.Theparentsleavethelightonsotheycanseethekid.Thekidsinheritbadeyesightfromtheirparents.

Parameter : a quantifiable characteristic of interest in the popula- tion. Theaverage weight of all of the fish in the Fraser Riveris a parameter. The samecharacteristicinthesampleisknownasastatistic.Wecaneasilyfindthisstatisticbecausethesampleisonlyasubset,andthereforesmaller,thanthepopulation.Weusuallyneverknowtheactualvalueofaparameter,butthatdoesn’tmatterbecausewecanestimatefromthesample.

Pointestimate :astatistic,suchasthemeanoraverage.Itiscalledapointbecauseitisanexactnumbersuchas4.21.Itisnotarange,itisapoint.Itiscalledanestimatebecauseitisanestimateofthepopulationparameter.Therefore4.21isanestimateofwhatthesamequantifiablefeaturewouldbeinthepopulation.Youcanthinkofapointestimateasbeingthe‘bestguess’forthepopulationparameter.

Ifwe selectedadifferent sample from the population, then almost certainly thepointestimatewouldbedifferent,givinganewbestguess.Ifwecontinuetotakesamples,thenwearegoingtogetasmanybestguessesaswehavesamples.Therangeofbestguessesisthesamplingvariation .

Poissondistribution

ThePoissonprobabilitydistributionis,aswiththeBinomial,adis-creteprobabilitytool.Wecanuseitforcalculatingtheprobabilityofeventswhichoccurovertimeorspace.Forexample,numberofmistakesinagivennumberof linesof code.Theformulais:

f( x )=

µ x e− xx !

wheref(x)istheprobabilityofxoccurrencesinagiveninterval.mu

Page 245: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

istheexpectedvalueormeannumberofoccurrencesinthegiveninterval(orarea).eisaconstant,value2.71828.

Population : the group of individuals about whom we want some information.Each individual in the population is called an element or sometimes a case .Examplesofpopulationsareall the fish in the FraserRiver,Toyotacarsmadein2012,orindeedthepopulationofacountry.Itisusuallynotpossibletodealwithdataatthepopulation levelbecauseitiseitherexpensiveorimpossibletoobtain.Asaresultwedrawasample,orsubset,fromthepopulation.

Randomisationisthetechniqueofselectingelementsfromthepopulationtobuildasamplesothateachelementhasthesame known probability of being selected.This technique greatly reduces bias, described in more detail in the followingchapter.

Regression

Samplingframe :alistoftheitemstobesampled.Imaginethatyouwishtosurvey100studentsfromaUniversity.TheUniversityprovidesyouwithalistofstudentsandtheirIDnumbers.Thestudentsformthepopulation,andthelististhesamplingframe.Youwouldusearandomsamplingmethod,coveredbelow,torandomlyselect100studentsfromtheUniversity.The100studentsformthesample.

Standarderror :Trythis thoughtexperiment.Let’ssay that thereare1000jellybabiesinabowlandforsomereasonyou’dliketoknownthemeanweightofajellybabywithouthavingtoweighall ofthem.Youhaveachoiceoftakingarandomsampleofeithern

= 100 or n = 500. Which sample would produce the most accurate estimate:obviouslythesamplesizeof500becauseitis‘closer’tothepopulation size.

Yetitstillwon’tbecompletelyaccuratebecauseitcouldhappenthatthesampleyouselect is biased in someway: perhaps you selected all theheavyones?We canmeasuretheamountoferrorwiththestandarderror,whichhelpsustodeterminehow‘close’wearetothe

Page 246: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

unknownparameter.BelowIshowhowtocalculatethestandarderror(SE).Hereisthestandarderrorforastatisticsuchasamean:

σSE =√ n

Inwords, the standard error is the population standard deviation dividedby thesquarerootofthesamplesize,denotedbyn.Notethatasthesamplesizeincreases,thenthestandarderrordecreases.

Thisis in linewithyourearlier intuition that the larger the sample size then themoreaccuratetheestimate.Usuallywedon’t know‘sigma’,σ ,sowereplaceitwith‘s’whichisthestandarddeviationofthesample.

Transformations :thisisatechniquesometimesusedinregression. Theobjectisusually to transform thedistributionof the dependent variables so that it moreclosely resembles a normal distribution. We do this because the theoreticalunderpinning of linear regression is basedonanassumption of normality in thedependentvariable.Thedependentvariable is typically transformedby taking itsnaturallogarithmandthenusingthetransformedvariableintheregression.Thisisoften used when the dependent variable has only positive values and is highlyskewed to the right. Independent variables can also be transformed such as byaddingasquaredversionof the variable in the regression. Independent variablescanalso betransformedby

zscores :A z score is ameasure of how far an observation contained within avariable is from themean of that variable, in termsof standard deviations. It isquiteeasytocalculatewithinExcel,andthisYoutubeshowsyouhow

zscoresYoutube⁴.

Whydothis?Bydividingthedifferencebetweenthevalueofthe

⁴https://www.youtube.com/watch?v=FSZpynSBev8

Page 247: Straightforward statistics using Excel and Tableau: A ...index-of.co.uk/OFIMATICA/Straightforward statistics... · 1.1 This book is a little different Most books on stats introduce

observationandthemeanofthevariablebythestandarddeviationofthevariable,likethis:

z i=

x i−x ¯

s

thedifferencesareinunitsofstandarddeviations.Wecanusethisfor:

•Comparingthedistributionsoftwocompletelydifferentvariables

• Detectingoutliers.Anoutlierisanobservationwhichisveryfarfromtherestofthedistribution.Usually,99.7%of the observationswillbewithinthreestandarddeviationsofthemean.Soanobservationwhichismorethanthreestandarddeviationsfromthemeanislikelytobequestionable. Thiscanbegoodorbad:

•Badbecauseperhapssomeonemadeadataentryerrorwhichwecancatch.

• Good because perhaps there is something anomalous and interestingaboutthat‘weird’observation.