Application of Data Mining for identifying and predicting ...Application of Data Mining for...

Preview:

Citation preview

i

ApplicationofDataMiningforidentifyingandpredictingroombookings

CatarinaCaldeiraMaçaroco

WindowofOpportunity

ProjectWorkreportpresentedaspartialrequirementforobtainingtheMaster’sdegreeinInformationManagementandBusinessIntelligence.

i

NOVAInformationManagementSchool

InstitutoSuperiordeEstatísticaeGestãodeInformaçãoUniversidadeNovadeLisboa

APPLICATIONOFDATAMININGFORIDENTIFYINGANDPREDICTING

WHICHEVENTSLEADTOROOMBOOKINGS–

WINDOWOFOPPORTUNITY

by

CatarinaMaçaroco

ProjectWorkpresentedaspartialrequirementforobtainingtheMaster’sdegreeinInformationManagement,withaspecializationinKnowledgeManagementandBusinessIntelligence

Advisor:RobertoHenriques,PhD

February2019

ii

ABSTRACT

Thisstudyinvestigatesthesearchpatternsforpredictinghotelbookings.UsingExpedia’ssearchandpurchasedata,Iidentifiedtheuser’sbookingwindowandwhicheventshaveahighereffectonthebookinglikelihood.

Thetourismindustryhasseenexponentialgrowthoverthelasttwodecades,muchduetoglobalsocio-economicchanges,globalizationandinternetmassification.Portugalisfinallyreapingitsshareofprofitandcontinueswinning“bestdestination”prizesyearafteryear.Itisnowpartofaverycompetitiveecosystemwheredistributionplaysadeterminantrole inwhetherthetouristicproductsurvivesornot. Big players like Expedia and Booking.com have taken control of a big chunk of themarket’srevenue because they understood that the large amounts of data they have enabled predictingdemandhenceofferinghighlycompetitivedeals.

Only by understanding the booking drivers one can negotiate better distribution deals and lowercommissions through making better sales predictions and enhancing marketing and revenuestrategies,hencethepurposeofthisstudybeingtousedataminingtofindpatternsinroombookings,enablingthisindustrytobecomeanevenmoreimportantsourceforthecountry’sGDP.

Throughtheanalysisofconsumerbehaviorandbookingtimesandtheuseoftimeseriesanalysisandmachinelearning,itispossibletofindpatternsthatcanbeappliedacrossnotonlyHospitalitybutotherindustriesaswell.

By looking for the connections and the relevance of each feature on the final predictions, a newwindowofopportunitywillopenformarketingandsalesprofessionals.

KEYWORDS

Salestrends;Micromoments;Macromoments;ZeroMomentofTruth;BookingWindow

iii

INDEX

1. Introduction.................................................................................................................1

1.1.Backgroundandproblemidentification...............................................................1

1.2.StudyRelevance...................................................................................................2

1.3.StudyObjectives...................................................................................................3

2. Literaturereview.........................................................................................................4

3. Methodology...............................................................................................................7

3.1.ProblemDefinition...............................................................................................8

3.2.DataCollection.....................................................................................................8

3.3.DataSamplingandCleaning.................................................................................8

3.4.Software...............................................................................................................8

3.5.Selectingandfittingmodels.................................................................................8

3.6.Validatingresults..................................................................................................93.7.Featureengineering.............................................................................................9

3.8.Models................................................................................................................10

4. Resultsanddiscussion...............................................................................................12

4.1.DescriptiveAnalysis............................................................................................12

4.2.PredictiveAnalysis..............................................................................................12

4.3.PrescriptiveAnalysis...........................................................................................15

5. Conclusions................................................................................................................16

6. Limitationsandrecommendationsforfutureworks.................................................17

7. Bibliography...............................................................................................................18

8. Appendix....................................................................................................................21

8.1.Appendix1–DescriptiveAnalysis......................................................................21

8.2.Appendix2–PredictiveAnalysis........................................................................24

8.3.Appendix3-Automation...................................................................................26

iv

LISTOFFIGURES

Figure1-Workflowdiagram.....................................................................................................7

Figure2-Dataikuworkflow....................................................................................................11Figure3-Searchwindowperweekpercontinent.................................................................21

Figure4-Searchwindowwhenbookingwithpackageorwithoutpackage..........................21

Figure5-Searchwindowifbookingincludesweekendordoesnotincludeweekend..........22Figure6-Searchwindowwhenbookingwithchildren..........................................................22

Figure7-Numberofbookingsincludingchildren..................................................................23

Figure8-Numberofbookingspercontinent.........................................................................23Figure9-Variablesimportance..............................................................................................25

Figure10-Bookinglikelihood.................................................................................................25

Figure11-Automationworkflow...........................................................................................26

v

LISTOFTABLES

Table1-Hyperparameters.......................................................................................................9

Table2-Cohorts’results.........................................................................................................13Table3-Algorithmdetails......................................................................................................14

Table4-Qualitymetrics.........................................................................................................14

Table5-VariablesExplanation...............................................................................................24Table6-Models'results.........................................................................................................24

vi

LISTOFABBREVIATIONSANDACRONYMS

API ApplicationProgrammingInterface

ARIMA AutoregressiveIntegratedMovingAverage

AHRESP AssociaçãodaHotelariaRestauraçãoeSimilaresdePortugal

CPC Costperclick

CMS ContentManagementSystem

GDP GrossDomesticProduct

LSTM Longshort-termmemory

MCC MatthewsCorrelationCoefficient

OTA Onlinetravelagent

ROCAUC ReceiverOperatingCharacteristicAreaUndertheCurve

ROI Returnoninvestment

SETAR Self-ExcitingThresholdAutoregressive

XGBOOST ExtremeGradientBoosting

ZMOT Zeromomentoftruth

1

1. INTRODUCTION

Competitionforroombookingshasneverbeenthisfierce.FromHotelstodestinationmanagementcompaniestoOTAs,everycompanyiseagertotakeitscutfromthesellingprice(KevinMay,2017).

This study aims to use popular machine learning models to predict likelihood to book and actaccordinglytothatprobability.

Throughfixedcommissions,costperacquisitionorfixedfees,lodgingandaccommodationbusinessesarecompletelydependentonthirdpartybookingagents.Refusingfrombeingpartofthesystemwouldleadtocompleteisolationinamarketthathasbeengrowingatarateof2,5%peryear(INE,2017)interms of number of Hotels and where investors are confident in continuously stronger revenues,where tourism now represents 10% of the PortugueseGDP (Ferreira, 2017), currently the highestsource.

Analysingthecustomerbehaviourandpredictingthebookingwindowisessentialtogainacompetitiveadvantage in amarketwhere advertising spent is reachingmonstrous levels, with companies likeBooking.comspending$3,5billioninPayperclicklastyear(KevinMay,2017).

XBBoost proved to be theModel that performed best and provided themost accurate predictionpower.Theresultssuggestthatbyaddingfurtherrelevantvariablesthemodel’spredictivepowerwillincrease. I was able of identifying the users that aremost likely and least likely to book, creatingdifferentsegmentswhichcanbeverypowerfultogenerateeffectivestrategies, impactingtherightuserswiththerightnudgesthatleadtomorebookings,hencemorerevenue.

Withtheidentificationoftheexactperiodsandvariablesthatleadtoabooking,theadvertisingspentcanbebetterallocatedandrevenuestrategiescanbecomemoreeffective(TrefisTeam,2016).

1.1. BACKGROUNDANDPROBLEMIDENTIFICATION

WiththeincreasingimportanceoftourismbothgloballyandespeciallyforPortugalanditsGDP,itisessentialtotakefulladvantageoftheemergingopportunities.

Thetourismandhospitalitymarketsarewidelydiverseandcomplexmainlybecausetheydealwiththe human need for emotions and experiences (Wu, Mattila, & Hanks, 2015). When looking foraccommodationforeitherbusinessorleisure,theindividualismovedbyasenseofurgency,aneedforrelaxation,asuddenrushforsomethingnew,justtonameafewdrivers.Thesedriversarewhatmarketersarethrivingtograsp,becausedoingsomeansthatonereachestherevenuesourcefirst,henceavoidingorgainingcommissions.

Eventsthatleadtobookingsrangefrommundaneeverydayeventssuchasfeelingtiredafteralongdayandinneedforanescape,orSundaylunchwithfamilyleadingtobookingatrip.Itcanalsobecausedbyextremeevents,liketheEurocup’sfinal,ChristmasorNewYears’Eve.Still,forecastingforextremeeventsisextremelydifficultduetotheirlackoffrequency,somethingthatUberisinvestedinsolving(Laptev,Smyl,&Shanmugam,2017).

Companiesareusingonlinemarketingtoreachtheaudienceandcaptivatetheusers’interestbeforeothers (Torres, Singh,&Robertson-Ring, 2015), yet themarket is fierce and competitive. Knowing

2

whencustomersbookisachancetouseGuerrillamarketingeffectively,butititisextremelydifficultandtimeconsumingtoplacetherightadvertinfrontoftherightconsumer,attherighttime(Hudson&Thal,2013).

ConsideringthatBooking.comisspending$3,5billioninCostPerClick(CPC),andiscurrentlyoneofGoogleAdwords’biggestclient(KevinMay,2017),mostcertainlyithasbeenconductingthesestudiesforquiteawhile.Nevertheless,itisnotintheirinteresttosharethisknowledge,hencethereasonfortheidentificationofthisproblemasonethatcanbetackledindependentlyinthisstudy.

Multipleapproacheshavebeenexploredbyrevenuemanagers,throughtheexplorationofhistoricaldataand relating thiswithpastand futureevents takingplace in the locationsofbuyerandseller(Constantino,Fernandes,&Teixeira,2016),yetitisnotanexactsciencethanratheritisahunch.

Thisstudywilltrytogoastepfurtherandpullthedatathatdeliverstrueinsightsandvalue,inthiscase,customerbehaviour.Customerbehaviouranalytics iswhatdrovetraditionalmarketingtothesecond plan and brought digital marketing into play. Contrary to traditional marketing, digitalmarketing’scorevalueisthateverythingcanbemeasuredthusenablingamuchcomprehensiveviewoftheclientandefficientstrategiestogeneratevalue(JulieCave,2016).

Google and Facebook, for example, store gigantic amounts of data about every single individual(Bangwayo-Skeete&Skeete,2015), this is their leverage fordelivering therightadvertising, to therightcustomer,attherighttime,leadingtoconversionsandadvertisersrelianceontheireffectivenessduetothehighROIresults.

Itisnowuptoeachcompanytodetangletheinformationtheypossess(salesdata)correlateittocity’sopendataandfindthepatternsthatcanleadtocompetitiveadvantage(PatrickWhyte,2017).

Thefollowingvariableswillbeconsideredinthisstudy:

-Salesdata

-Consumerbehaviour

-Economicsituation(individual,localandglobal)

-Socio-economicevents

Applyingtimeseriesandmachinelearningtosalesforecastenablesfindingthepatternsthatcanbetranslatedintobookingwindows(Bangwayo-Skeete&Skeete,2015).

Therelevanceofthisstudyanditsmainobjectiveswillbefurtherdescribedinthefollowingchapters.

1.2. STUDYRELEVANCE

Thetourismindustrywillemploy10%ofthepopulationworldwidebytheyear2027.InPortugalnow,thisindustryrepresentsalmost1millionjobs,numberswhichareexpectedtogrowto1.034.000jobs,representing22,6%ofthetotalemployedpopulationby2027(JoanaNunesMateus,2017).

Representing17%ofthePortugueseGDPandanindustrywherealmostaquarterofthePortuguesepopulationrelyon,onecaneasilycalculatethatthecommonlycharged10-25%commissiononevery

3

salestransactionthroughintermediariessuchasBooking.comandExpediafordistribution,isrevenueinasense“lost”andthatcouldbepartiallyreacquiredifproperdataexplorationwasdone.

It is therefore of great importance that independent studies start being developed regarding thissubject,onlythroughcompleteawarenessofthesalesdriverscanonestayafloatandminimisetheadvertisingandcommissions’expenditureandownback itspowerandstrategydefinition,withoutbeingatthemercyofthird-partyvendors

1.3. STUDYOBJECTIVES

Themainquestionofthisstudyisposedas:Isitpossibletopredictthebookingwindowsusingsearchandbookingdata?

Theanalysisismadeonsaleshistoricaldataandsearchtrends.Thesewillaimtoanswerthefollowingspecificobjectives:

-WhichtrendscanbefoundthroughthedataprovidedbyExpediathroughKaggle?

Data from the Kaggle competition concerning the time of booking and for which dates will beanalysedtofindpatternsthatwillthenbecorrelatedwithothervariables.

-Whenisthebesttimetoreleasemarketingcampaigns?

Bycrossingthebookingtimeswithspecificvariables/timesofday/dayoftheweek, it ispossibletodetectthebookingspecificdrivers,thatenableagoodpredictionoftherighttimetoimpacttheuser.

-Whichdataimprovesdemodel’spredictabilityperformance?

Variousdatafeatureswillbeanalysed,themodel’saccuracywillvarydependingonwhichfeaturesareusedandthiswillultimatelyhaveaneffectandleadtoreachingthestudy’sobjective.Thiswillonlybepossibletotestoncevariousmodelsarecreated,ranandanalysed.

-Whichcorrelationscanbefoundtoaugmentdata’svalue?

Aftercollectingallthedatanecessaryforthisstudy,acorrelationbetweenthedatafeatureswillbemadetoseeiftheeventdidinfluencetheroombookings.

-Whichmodelbetterfitsthedata?

Severalmodelswillbeappliedtothedatatodetectthemodelwiththebestpredictiveoutcome.

4

2. LITERATUREREVIEW

Thereareseveralpublishedresearchpapersonthetopicofforecastingtourismdemandandtrends,theirtechniquesandapproacheswerestudiedandtestedtoenablethisstudy.

Thisstudy’sobjectiveistoevaluatetheforecastingperformanceofartificialneuralnetworksrelativeto different time series models using Expedia’s search and booking data from Kaggle’s 2013competition(Kaggle,2015).

TheTourismindustrycontributedUS$7.6trilliontotheglobaleconomy,10.2%ofglobalGDP(Misrahi&Crotti,2018).InPortugalthecontributionofTourismtothecountry’sGDPisof7,1%,thereisaclearopportunity for growth when compared with the global landscape (INE, 2017). Tourism is one oftoday’sfastestgrowingeconomicactivitieshencetourismdemandforecastingbecomingessentialtomonitorandpredicttourisminflux,revenueforecastingandbudgetallocation.TheresearchersSongand Lee found it crucial to improve the accuracy and performance of analysis methodsbyexperimentingwithnewapproaches(Chan,Witt,Lee,&Song,2009).

It isnotpossibletostocktheunfilledairlineseats,unoccupiedhotelrooms,orunusedconcerthallseats.Duetotheperishablenatureofthetourismindustry,theneedforaccurateforecastsiscrucial(Law&Au,1999).

ArelevantarticleforthisstudyisForecastingtourismdemandtoCatalonia:Neuralnetworksvs.timeseriesmodels(Claveria&Torra,2014),inwhichtimeseriesandartificialneuralnetwork(NN)modelsareusedtoextractpatternsandpredictiveresultsfromCatalonia’stourismdemand.Therehasbeenanincreasedinterestinmoreadvancedpredictivetechniquesfortourismdemand.Whichistiedwithtourism becoming an increasingly stronger global industry. The use of Artificial Intelligence (AI)techniquesfordataanalysishasbeengrowingduetotheneedformorereliableandaccurateforecastsof tourismdemand that candealwith increasingcomplexity.This ismainlybecauseAImodelsarebettercapableofdealingwithnonlinearbehavior,characteristicoftraveldata,inthiscase,bookingsdata.Still,whencomparingtheforecastingaccuracyofthedifferentmodels,AutoregressiveIntegratedMoving Average (ARIMA) outperformed Self-Exciting Threshold Autoregressive (SETAR) and ANNmodels,especiallyforshortertimespans.Theoriginaldatasetpre-processingmaybethereasonfortheseresults,wheretherewasinformationlosswhenaccountingforthepresenceofseasonalityandeliminatingoutliers,whichleadstoaloweraccuracyofneuralnetworkforecasts.Neuralnetworkscanbeimprovedthroughstructureoptimization,addinglayersandmemoryvalues,hencefutureresearchneeds to consider whether the implementation of optimised neural networks and advances ondynamicnetworksdoimprovetourismdemandforecasting.

Inthepastfewyearsduetonewandmoreadvancedforecastingtechniques,suchasNeuralnetworksandGradientBoosting,and theneed formoreaccuratemetricsof tourismdemandthe interest inArtificial Intelligence (AI) and experimentation with those same techniques grew, mainly becauseoftheircapabilityofhandlingnonlinearbehaviour(Pai,Hung,&Lin,2014).

Theincreasingavailabilityoftechnologyatlowercostsenablesaswelltheever-wideradoptionandexperimentation.Technologycompanies,e.g.Google,Facebook,AmazonandUberarenowprovidingopensourcesoftwarethatenablesdevelopersanddatascientiststofurtherexplorethecapabilitiesofArtificialIntelligence(AI)andMachineLearning(ML),thisleadstohigherinterestincreatingbetter

5

predictive solutions (Mukherjee & Lakshmanan, 2017). With cloud computing, it is possible tohave efficient energetic usage and ease in scaling and cost savings (Zhang, 2016). Upfront costcommitmentismuchlowernowwhencomparedtoon-premisessolutions.

Whenconsideringopendata,onestudyanalyzedGoogle’sTrendsdataandexaminedtheusefulnessof “hotels” “flights” and “destination country” search indicators andmeasured inwhat extent thesearchqueriesdataimprovedtheARandtheSARIMAmethodspredictingovernighttouristarrivals.ThetwelvemonthforecastresultsrevealthatAR-MIDASmodelsgavesuperiorpredictionstoARandSARIMA time seriesmodels in terms of the RootMean Squared Error (RMSE) andMeanAbsolutePercentError(MAPE)forecastingcriteria.

Googlequerysearchdatacanbeusedtoaccuratelyprojectfuturetouristarrivalsoverayear'shorizon.Thisstudycontributedtothegrowinginterestinforecastingusingwebtrafficdatamakingitrelevanttootherindustriesthatcanbenefitfromtheanalysisofwebsearchvolumehistoriestopredictusefultrends(Bangwayo-Skeete&Skeete,2015).

In2014astudylookedintoGoogle’sSearchDataandexaminedtheusefulnessofsearchindicatorssuchas“hotels”and“flights”andtesteditsimpactonthesimpleAutoregreessivemethod(AR)andtheSeasonalAutoregressiveIntegratedMovingAverage(SARIMA)(Bangwayo-Skeete&Skeete,2015).

The Tourism forecast combination using the CUSUM technique article (Chan et al., 2009)demonstratedthatthereisnosinglemethodthatoutperformedothersinforecastingaccuracy,itwasthecombinationofmethodsthatproducedbetterresults.

ThisleadtotheincreasedinterestinNNwhichperformedbetterthattimeseriesmethods,especiallyduetoitscapacitytodealwithnonlinearrelationshipbetweenpredictorsandpredictedvariables.

A 2014 study tested the accuracy of neural network ensemble prediction when compared withtraditionalmachine learningmethods and traditionalmathematical statisticsmethods for studyingChina’s inbound tourism market in which neural network ensemble had clear better predictivecapabilities(Bo&Shi-Ting,2014).

TheNNscapacitytoemulatethehumanbrainto identifypatterns inhistoricaldataandlearnfromexperiencetocapturefunctionalrelationshipsamongdatawhentheunderlyingprocessisunknown(Claveria,Monte,&Torra,2015)alsoleadstoitsshortcomingsofbeingofpoorcomprehensibilitydueits“blackboxmodel”,helpfulforpredictingresultsbut lackingincomprehendingthenatureofthestudyinquestionduetonotallowinginterpretabilityoftheoutcomesandcoefficients.

In2015’sstudy“Commontrendsininternationaltourismdemand:Aretheyusefultoimprovetourismpredictions?”(Claveriaetal.,2015)researchersmodelledtourismdemandincorporatingthecommontrends in international tourist arrivals from all visitor markets to a specific destinationandanalyzedwhethertheapproachallowedimprovingtheforecastingperformanceofNNmodels.Theyused threeNNs: themulti-layerperceptronnetwork (MLP), the radialbasis functionnetwork(RBF)andtheElmannetwork.

In the study “Univariateversusmultivariate time series forecasting: anapplication to internationaltourismdemand”(duPreez&Witt,2003)itwasfoundthatunivariatetimeseriesmodelswerenotsurpassedbymultivariatetimeseriesmodelsformoreaccurateforecasting.

6

Experimentalresultsdemonstratedthattheforecastingefficiencyofaneuralnetworkissuperiortothat of multiple regression, naive, moving average and exponential smoothing. This indicates thefeasibilityofapplyinganeuralnetworkmodeltopracticalinternationaltourismdemandforecasting(Law&Au,1999).

Uber, a global car-hailing andmobility technology company based in San Francisco, United Statesconductedastudytocreateamodel for forecastingextremeevents.Theyusedthousandsof timeseriesbasedonLongShort-TermMemory(LSTM)totrainamulti-moduleneuralnetwork.TheychoseLSTMduetoitscapacityofmodellingcomplexnonlinearfeatureinteractionsthroughworkingwithlarge amounts of data across numerous dimensions and use of external variables and automaticfeatureextraction(Laptevetal.,2017).

AI can identify patternsor irregularities thatwouldotherwise stayhidden.MLhas the capacity ofspotting opportunities that can make the difference. Its value is exponentially increased whencombinedwithhumananalysis,thatiswheninsightandvisionenabledatatotakeformandtranslateintoactionablestrategies.

7

3. METHODOLOGY

ForthisstudyIdevelopedthefollowingmethodologyanddiagram,illustratedinfigure1.Itdeliversaforecastingmethodthatwillenableaccuratepredictionsbasedonpastandfutureeventsaffectingroombookings.Themethodologyiscomposedofthefollowingeightstages:

§ Objectives

§ Datasources

§ DescriptiveAnalysis

§ FeatureEngineering

§ Modelling

§ ModelsFineTuning(ResultsandRevalidation)

§ ResultsAnalysis

§ Automationadvice

Figure1-Workflowdiagram

8

3.1. PROBLEMDEFINITION

Toconducteffectiveforecasting,oneneedstoensuretheproblemdefinitionhasbeenfullyexploredanddefined.Theproblemdefinitiondescribedintheprevioussection,2.1,servesasacompassforthefollowingstepsandthesuccessfulevaluationofthemethodologyandtheobtainedresults.

3.2. DATASOURCES

Thissectionprovidesanoverviewofthedata,thepre-processingandcleaningstepstaken,aswellasfeatureselectionandengineering.

DatawasretrievedfromKaggle’sExpediacompetition(Kaggle,2015).Thedatacoversusers’searchandbookingdatafrom2013and2014,620440records.Thisincludesbothclickandbookingevents.

The competition goal was to predict booking outcomes for a user event, based on their searchpatterns.

Thisdatawasselectedduetotherelevantfeaturesthatenabletheuseofdifferentmodelsandproperprediction outcomes. The dataset contains features that provide general information about theusersuchasID,users’continentwhenbookingandthedestinationofthebooking.Italsoshowstheusers’behaviourwithcheckinandcheckoutdatesandtimeofbooking.Aswellasifatthetimeofbooking therewasor not a promotionbeingdisplayed. The variables explanation canbe found inAppendix2,table5.

3.3. DATASAMPLINGANDCLEANING

Fromthe largesetofdatacollectedfromKaggle’scompetition,only8,4%wereactualbookings,toovercomethisissueIrebalancedthesampletoincreasetheinstanceswheretherewasabooking.

Iremovedtherowswithmissingvaluesinorigin-destinationdistance,whichamountedto36,8%ofthedata.Aftercleaningthemissingvalues,Iwasleftwith409602rowsofdata,enoughforthisproject.Missingvaluescancomplicate theanalysisof thestudy. I chose todropthedata rowsrather thanimpute valuesbecause it accounted to36,8%of thedata,which is ahighpercentageof values toimputeandcouldinthefutureformulatewrongpredictions(Kang,2013).

3.4. SOFTWARE

Dataikuversion4.1wasthesoftwareusedinthisstudy.ItenablestheuseofMachineLearninginaclearmanner.IwasabletobuildandoptimizethemodelsinPythonwhichallowsforseamlessfutureintegrationsthroughanAPItoexternallibraries.

Dataiku enables the creation, training and deployment of advanced custom Machine LearningModelsthroughtheuseofPythonorR.ForthisstudyIchosePython.

3.5. SELECTINGANDFITTINGMODELS

Different approacheswere tested to be able to deal effectivelywith the different data types anddimensions. Flexibility and scalability are essential for this study andmodel developmentwhich ispossiblewiththeuseofDataiku.

9

InthisstudyItrainedthedatawithLogisticRegression,RandomForest,XGBoostandNeuralNetworks.

Ichosethesemodelsduetotheircharacteristics,asfollows;LogisticRegression’soutcomesenableinterpretationwhichwasessentialtodeterminewhichvariablesaremostrelevantforthepurposeofthisstudy.RandomForestisaneasytousemodelwhichiscapableofdealingwithmultiplevariablesinaneffectiveway.Itiswidelyusedduetoitssimplicityandthefactthatitcanbeusedforclassificationandregressiontasks.NeuralNetworksarecomplexuninterpretablemodelsthathavegainedtractioninrecentyearsduetotheircapacitywithdealingwithmultiplevariablesandprovideveryaccuratepredictionswithverylargedatasets.XGBoostisdesignedforspeedandperformanceandoutperformsvariousmodelsinitspredictivecapabilities.IthasbeenusedinseveralKagglecompetitionsproducinggreat results andwinning several competitions. XGBoostwas engineered to have amore efficientcomputingtimeandusememoryresources inanoptimalway. Inrecentyears,datascientistshaveusedthesemodelstosuccessfullypredictoutcomessimilartothoseofthisstudy(SunilRay,2017).Amorein-depthexplanationofeachmodelcanbefoundinchapter3.8.

Thedatasetwasdividedinto3cohorts,1,2andacohortwithallthefeatures,eachcohorthasavaryingnumberoffeatures,refertoTable2inResultsandDiscussion.Iwantedtofindifaddingmorefeaturesinthetrainingwouldimpactthemodels’results.

3.6. VALIDATINGRESULTS

Oncethemodelsweretestedandresultsretrieved,thesewillneedtoberevalidatedtoensurethatthedevelopedmodelsareaccurateandcanbetestedwithnewsetsofdata,continuouslyproducingqualityresults.Seeingiftheresultscanbegeneralized.Iusedtheseparatesubsetofthedatasettovalidate if themodelwouldoverfitwiththedata I trained itwith.RefertoTable1 forthemodels’hyperparameters.

Table1-Hyperparameters

ForthisprojecttheassessmentmetricwasROCAUCsinceIwantedtohavepredictionsoptimisedfortrue positives versus false positives. It enables seeing how the model performs at categorisingoutcomes.

3.7. FEATUREENGINEERING

Somefeaturescombinedwithotherscanprovideinsightsintothedatatheyaimtorepresent.

Timeofsearchandtimeofbookingwereparsedinordertoprovideaclearerunderstandingofsearchandbookingtimes.

10

Date_timedatawasparsedandthedatecomponentswereextractedintoYear,Month,Day,DayofWeekandHour.Inordertoseeifthereisatrendinmonthordayoftheweek.

Thesameprocesswasdoneforsrch_ciandsrch_co,check-inandcheck-outdates. Icomputedthetimedifferencebetweendate_timeandcheck-inandthedifferencebetweencheck-inandcheck-outdates.BinnedthesearchMonthsintoQuartersandconductedthesameprocessforthecheck-inandcheck-outmonths.

Also,binnedthesearchhourin4binswith6hourseach.Midnightto6am,6amto12pm,12pmto6pmand6pmtomidnight.

The target searchmonthwas extracted from the search date and the frombookingwindow. Thisfeaturecanproduceinsightsonhowcertainpropertiesmightbemoredesirablethanothersincertainmonths. This would enable confirming if a property in for example, Continent 3, would bemoresearchedforinJanuaryforstaysinSeptember,enablingthenbetterpredictions.

3.8. MODELS

ThemodelsusedinDataikuwere;RandomForest,LogisticRegression,XGBoostandArtificialNeuralNetwork.

Startingwith the LogisticRegression, it is a classificationalgorithm thatuses a linearmodelwhichcomputes thetarget featureasa linearcombinationof input feature. It ispronetooverfittingandsensitive toerrors in the inputdataset. Still, theuseofa simple linearalgorithmcanbehelpful inexploringthedataandreachinginsightsregardingthedata’sunderlyingstructure.

InadditiontothislinearmodelIselected3non-linearmodelstobettercapturethecomplexityofthedata,whichwere:

RandomForest,composedofmanydecisiontrees.Whereeachtreepredictsanoutcome,affectingthefinalansweroftheforest.Itisanensemblelearningmethodforclassification,regressionandothertasks.ByusingRandomForestIavoidedtheoverfittingofdecisiontrees,whichispossiblebyhavingarandomelementthatenablesthatalltreesintheforestarenot identical. It lacksexplainabilitybutgenerallyprovidesgood results. It candealwith themultivariatedataandbyaveragingacross thebookingprobabilities,shouldhelppreventover-fittingbyindividualtrees(NiklasDonges,2018).

XGBoost(eXtremeGradientBoostingmethod),anadvancedgradienttreeboostingalgorithm,whichusesparallelprocessing,regularization(thathelpspreventoverfitting)andearlystopping,makesitafast,scalableandaccuratealgorithm.Beinganensemblelearningmethoditcombinesthepredictivepowerofmultiplelearners.Theresultisasinglemodelwhichgivestheaggregatedoutputfromseveralmodels. The models forming the ensemble can be either from the same or different learningalgorithms.Still,theyhavebeenmostlyusedwithdecisiontrees.Alltheadditivelearnersinboostingaremodelledaftertheresidualerrorsateachstep.Theboostinglearnersmakeuseofthepatternsinresidual errors. At the stage where maximum accuracy is reached by boosting, the residuals arerandomlydistributedwithoutanypattern(RamyaBhaskarSundaram,2018).

Ithasbeenusedinreal-worldproductionpipelinesforadclick-throughratepredictionandprovidedstate-of-the-artresultswithvariousotherproblemssuchasstoresalesprediction;customerbehaviourpredictionandwebtextclassification(Chen&Guestrin,n.d.).

11

ArtificialNeuralNetworks are inspiredby the functioningof neurons, consistentof several hiddenlayersofneuronswhichreceiveinputsandtransmittheseintothefollowinglayer.Itcandealwithnon-linearityallowingforcomplexdecisionfunctions. It lacks interpretabilityoffeatures importanceforthemodel’spredictiveoutcome.

Estimating feature importanceandmodel interpretability ingeneral isanareawhereHaldaretal.,took a step back with the move to NNs. Estimating feature importance is crucialin prioritizing engineering effort and guiding model iterations.The strength of NNs is in figuring out nonlinear interactions between the features. It is also theweakness when it comes to understanding what role aparticular feature is playing as nonlinear interactionsmake it very difficult to study any feature inisolation(Haldaretal.,n.d.).

Thetargetvariablewasifusersconductedabookingornot.

Allfeatureswereinputtedinthemodelexceptdate_time,user_id,srch_ciandsrch_co,sincethesehave direct correlation with the engineered features created which ultimately provide higherpredictionvalueandavoidoverfitting.

IoptimizedthemodelsforROCAUC,whichshowsmetheperformanceofthemodelsandthetruepositivesagainstthefalsepositivesrate.

TheDataikuflowshowninFigure2illustratesthestepstakeninthedataanalysisandmodellingstepsofthestudy.Thedatawasimported,cleanedandthendividedintotestandtraining.Thetrainingsetwastrainedwiththevariousvariablecohortsanddifferentmodels.

Finallytheresultsweredividedintwotoensurethattherewasnooverfittinginthedata.

Figure2-Dataikuworkflow

12

4. RESULTSANDDISCUSSION

4.1. DESCRIPTIVEANALYSIS

By considering the Booking windows between continents, there are clear differences betweencontinents. Considering continents 4 and 1, the booking window has significantvariancethroughouttheyear,butamuchsoftervarianceforcontinents0,2and3.

For continent 4, week 3 has the highest average bookingwindowwith 108 days. And the lowestbookingwindowforweek26withbookingsbeingmade82daysinadvance.Refertoappendix1,figure3.

TheseresultscanprovidehighlyvaluableactionableinsightstoanycompanyadvertisingforHospitalityorHospitalityrelatedproductsandservices.

Theaveragesearchwindowwidelyvarieswhenconsideringthebookingsthat includeapackageornot.Whenincludingapackagetheaveragebookingwindowisof75,8daysagainst44,6dayswithoutapackage.Refertoappendix1,figure4.

There is notmuchdifference in the searchwindowwhen thebooking includes aweekendornot,rangingfrom50to54daysinbothscenarios.Appendix1,figure5.

Whenbookingwithchildrenthebookingwindowdrasticallychangesthemorechildrenyouhave,thereisaclearpatternwithinthedata.Ifbookingwithoutchildrentheaveragebookingwindowisof50,66days,whilewhenthebookingincludesachildtheaveragegoesupto54,87andgoingupto65,43dayswhenthebookingincludes3children.

Still,thesenumbersmaynothavestatisticalevidenceduetothescarcenumberofbookingswithmorechildren.78,5%ofbookingsaremadewith0children,11%withonechild,8,6%with2andIcouldseeasteepdeclineinbookingswith3children,representingonly1,4%ofthetotalbookinginthedataset.Refertoappendix1,figure6and7.

Mostbookingsaremadeforcontinent2,with63,9%ofbookings.Iassumethereforethatcontinent2is likely to be North America, due to the population volume and high numbers of internal travel.Appendix1,figure8.

Thesenumbersalsoreflectthehighermarketpenetrationinforcontinent2ratherthantheremainingcontinents.

4.2. PREDICTIVEANALYSIS

I ran threedifferent feature cohorts through themodels, addingmore features toevery cohort, itprovedtobethatthecohortwithallthefeatureshadthebestperformance.

13

Table2-Cohorts’results

WhatIfoundisthatsomemodelsareconsistentlybetterthanothersinpredictinglikelihoodtobook.XGBoost continuously outperforms the other models. Also, the more variables I added the moreaccuratepredictionsthemodelswouldbeabletoachieve.Allresultscanbefoundinappendix2table6.

PleaserefertoTable3forthealgorithm’sdetails.

14

Table3-Algorithmdetails

Intermsofvariablesimportanceinpredictivepower,hotelcluster,hotelmarket,searcheddestination,if packageornot, searchdurationandorigin-destinationdistanceproved tobe themost relevant.Pleaserefertoappendix2,figure9.

Itwaspossibletoidentifyandbinwhichusersaremostlikelytobook,mediumandleastlikelytobook.Refertoappendix2figure10.Thisallowssegmentationofnewcustomerswhichcanleadtodifferentstrategiesforeachsegment,enablingmoreprecisetargetingandbettercampaignsand/orpromotionsperformance.

AsIshowbelowthemodelwasgeneralenoughtoobtaingoodpredictions,refertoTable4.TheROCAUCwas0,7833.

Table4-Qualitymetrics

15

4.3. PRESCRIPTIVEANALYSIS

AnextstepwouldbetoimplementanautomationworkflowusingintegrationsbetweenDataiku,theexistingdatabaseandtheContentManagementSystem(CMS)asillustratedinAppendix3,figure11.

TheseintegrationsarepossiblethroughZapier,aweb-basedservicethatenablesseamlessautomationworkflowsbetweendifferentapplications,itcanbedescribedasatranslatorbetweenwebAPIs.

It ispossibletoapplythemodeldevelopedwithpastdataandfeeditwithnewdatatopredictthelikelihoodofacustomerbooking.

Thisultimately leadstoreal-timewebsiteupdatesshowingthetargetedpromotionsor informationthatleadtheconsumertobuy.Andultimatelybettermarketingstrategies.

16

5. CONCLUSIONS

Thisprojecthasoutlinedtheprocessofbuildingamodeltopredicthotelbookings.TheanalyzeddataincludedsearchpatternsandbookingsfromExpedia’sdatasetduringtheperiod2013-2014.

I foundthatwecanpredictwhensomeonewouldbookornot,basedonsearchpatternsandthatvariablessuchasthenumberofchildren,orthelocationwherethebookingisbeingmadehaveanimpactonthebookingwindow.

I was capable of identifying the users that aremost likely and least likely to book, creating threedifferentsegmentswhichareverypowerfultogenerateeffectivestrategiestoimpacttherightuserswiththerightnudgesthatwillleadtomorebookings,hencemorerevenue.

Thisultimatelyallowsmarketersandhospitalityprofessionalsorofanyactivityrelatedtohospitalityto stir theirmarketing efforts inmore effectiveways, creating the right promotions ormarketing‘nudges’atthemostrelevanttimeinthesearchandbookingfunnel.

Featureselectingprovedtobeextremelyimportantforreachinggoodresults,asincludingafullsetoffeaturescancreatetoomuchnoiseandmakeitdifficulttofindunderlyingpatternsthatwillmakethemodel better. This was not the case for this project, where including the whole set of variablesproducedthebestmodeloutcomes.

XGBoostprovedtobethemosteffectivePredictiveModel,withthehighestROCAUC.Hencethemodelmostsuitedforthisanalysisandtrainingoffuturefeaturecohortsforsearchandbookingdata.

The objectives of this study were different from the objectives established within the Kagglecompetitionforthisdataset.Thisstudy’stargetwastopredictthelikelihoodofbooking,whilstthecompetitionwascreatedtodeterminethelikelihoodofauserstayingat1ofthe100hotelgroups.

This dataset was selected due to having the relevant attributes for this study’s purpose. Kaggles’winningmodelsareveryspecificandtailoredforthecompetitionandwhenpresentedwithothersetsofdatadonotperformaswell.

HotelsandHospitalityprofessionalsfrommultipleindustrieshavecompleteaccesstothisdata,thecomputationalpoweravailabletodayandthevariousfreetoolinganddatasetsavailableprovideallthenecessaryinstrumentstoconductmeaningfulanalysisthatcanpushhospitalityrelatedcompaniestotheforefront.Itallcomesdowntopropertimeallocationandone’swillingnesstotestandplaywiththeavailableresources.

17

6. LIMITATIONSANDRECOMMENDATIONSFORFUTUREWORKS

Futureworkscouldconsidercombiningexternalvariablesandextremeeventsandcrossingthesewiththedataathand,findingcorrelationswiththese,thatwillleadtobetterpredictivepower.

Automationofthemodel’spredictiveoutcomesshouldalsobeconsideredandfurtherexplainedinafutureproject.Allowingfordata-orientedmarketingstrategiesthatcansurelyoutperformtraditionalmarketingefforts.

Becausedatawasanonymised,Icouldnotseetheactualcontinentsandcountrieswherethebookingsweretakingplace.Thepriceswerealsounavailable.Meaningthatthedatahadpredictivepowerbutlackedinterpretabilitythatwouldenableactingupontheresults.

Infutureworks,amorecompletedataset,withconcretevalueswouldenablemuchbetterpredictionsandactionableoutcomes.Beingable to see theeffectofprice inbookingpatterns andhow smallvariationsinpricecaninfluencethepredictivepowerofthemodels.

It isalsoreasonabletoconsiderthatweatherconditionsanda location’seconomic/politicaland/orenvironmentalstabilityplayanimportantroleindeterminingpropensitytobookacertainpropertyonacertainlocation,leadingtochangeintourismdemand.Thesearefactorsthatdynamicallychangeinacontinuousway,henceamajorchallengewouldbeprovidinganacceptablemeasurementforthesefactorsinordertoincludetheminfuturepredictivemodelling.

Futurestudiesshouldalsoconsiderimplementingpsychographicstraitsandpersonastoprovidebetterpromotionstotheconsumersmostlikelytobuy.Offeringdifferentproductsanddifferentpackagesdependingontheusers’levelofextraversionorconsciousness,twotraitsthatprovedtobeaccuratepredictorsofuserspropensitytobuycertainproducts(Rauschnabel,Brem,&Ivens,2015).

Using recommender systems can also be highly valuable to accurately predict users’ purchasingbehaviour,henceshouldbeincludedinfuturepredictionstudies.

18

7. BIBLIOGRAPHY

Bangwayo-Skeete,P.F.,&Skeete,R.W.(2015).CanGoogledataimprovetheforecastingperformanceoftouristarrivals?Mixed-datasamplingapproach.TourismManagement,46,454–464.https://doi.org/10.1016/J.TOURMAN.2014.07.014

Bo,X.,&Shi-Ting,L.(2014).20147thInternationalConferenceonIntelligentComputationTechnologyandAutomation,IntelligentComputationTechnologyandAutomation(ICICTA),20147thInternationalConferenceon,IntelligentComputationTechnologyandAutomation,InternationalConf.https://doi.org/10.1109/ICICTA.2014.91

Chan,C.K.,Witt,S.F.,Lee,Y.C.E.,&Song,H.(2009).TourismforecastcombinationusingtheCUSUMtechnique.TourismManagement,31(6),891–897.https://doi.org/10.1016/j.tourman.2009.10.004

Chen,T.,&Guestrin,C.(n.d.).XGBoost:AScalableTreeBoostingSystem.Retrievedfromhttps://github.com/dmlc/xgboost

Claveria,O.,Monte,E.,&Torra,S.(2015).Commontrendsininternationaltourismdemand:Aretheyusefultoimprovetourismpredictions?TourismManagementPerspectives,16,116–122.https://doi.org/10.1016/J.TMP.2015.07.013

Claveria,O.,&Torra,S.(2014).ForecastingtourismdemandtoCatalonia:Neuralnetworksvs.timeseriesmodels.EconomicModelling,36,220–228.https://doi.org/10.1016/J.ECONMOD.2013.09.024

Constantino,H.,Fernandes,P.O.,&Teixeira,J.P.(2016).ModelaçãodaProcuraTurísticaparaMoçambiqueIVCongressoInternacionaldeTurismodaESG/IPCATourismforthe21stCenturyModelaçãodaProcuraTurísticaparaMoçambiqueJoãoPauloTeixeira,(July).

duPreez,J.,&Witt,S.F.(2003).Univariateversusmultivariatetimeseriesforecasting:anapplicationtointernationaltourismdemand.InternationalJournalofForecasting,19(3),435–451.https://doi.org/10.1016/S0169-2070(02)00057-2

Ferreira,A.(2017).Expresso|Turismo.Portugaléo14.omaiscompetitivodomundo.RetrievedJune13,2017,fromhttp://expresso.sapo.pt/economia/2017-04-06-Turismo.-Portugal-e-o-14.-mais-competitivo-do-mundo

Haldar,M.,Abdool,M.,Ramanathan,P.,Xu,T.,Yang,S.,Duan,H.,…Legrand,D.(n.d.).ApplyingDeepLearningToAirbnbSearch.Retrievedfromhttps://arxiv.org/pdf/1810.09591.pdf

Hudson,S.,&Thal,K.(2013).TheImpactofSocialMediaontheConsumerDecisionProcess:ImplicationsforTourismMarketing.JournalofTravel&TourismMarketing.https://doi.org/10.1080/10548408.2013.751276

INE.(2017).PortaldoInstitutoNacionaldeEstatística.RetrievedJune13,2017,fromhttps://www.ine.pt/xportal/xmain?xpgid=ine_main&xpid=INE

JoanaNunesMateus.(2017).UmQuartodosPortuguesesVaiTrabalharParaoTurismo|ExpressoEmprego.RetrievedJuly4,2017,fromhttp://portugalarecrutar.expressoemprego.pt/noticias/um-quarto-dos-portugueses-vai-trabalhar-para-o-turismo/4353

JulieCave.(2016).DigitalMarketingVs.TraditionalMarketing:WhichOneIsBetter?-Digital

19

Doughnut.RetrievedJuly10,2017,fromhttps://www.digitaldoughnut.com/articles/2016/july/digital-marketing-vs-traditional-marketing

Kaggle.(2015).ExpediaHotelRecommendations.Retrievedfromhttps://www.kaggle.com/c/expedia-hotel-recommendations

Kang,H.(2013).Thepreventionandhandlingofthemissingdata.KoreanJournalofAnesthesiology,64(5),402–6.https://doi.org/10.4097/kjae.2013.64.5.402

KevinMay.(2017).Googlecanrejoice:PricelineGroupspent$3.5billiononPPCin2016.RetrievedJune1,2017,fromhttps://www.tnooz.com/article/priceline-group-3-5-billion-advertising-2016/

Laptev,N.,Smyl,S.,&Shanmugam,S.(2017).EngineeringExtremeEventForecastingatUberwithRecurrentNeuralNetworks-UberEngineeringBlog.RetrievedJune18,2017,fromhttps://eng.uber.com/neural-networks/

Law,R.,&Au,N.(1999).AneuralnetworkmodeltoforecastJapanesedemandfortraveltoHongKong.Retrievedfromhttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.4253&rep=rep1&type=pdf

Misrahi,T.,&Crotti,R.(2018).TheTravel&TourismCompetitivenessReport2017Pavingthewayforamoresustainableandinclusivefuture.

Mukherjee,S.,&Lakshmanan,L.(2017).GoogleCloudprovidesaunified,streamlinedwaytoexecuteyourMLstrategy|GoogleCloudBigDataandMachineLearningBlog|GoogleCloudPlatform.RetrievedFebruary2,2018,fromhttps://cloud.google.com/blog/big-data/2017/11/google-cloud-provides-a-unified-streamlined-way-to-execute-your-ml-strategy

NiklasDonges.(2018).TheRandomForestAlgorithm.RetrievedJanuary3,2019,fromhttps://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd

Pai,P.-F.,Hung,K.-C.,&Lin,K.-P.(2014).Tourismdemandforecastingusingnovelhybridsystem.ExpertSystemswithApplications,41(8),3691–3702.https://doi.org/10.1016/J.ESWA.2013.12.007

PatrickWhyte.(2017).SmartCitiesNeedOpenDataandaWillingnesstoTestandLearn.RetrievedJune20,2017,fromhttps://skift.com/2017/06/15/smart-cities-need-open-data-and-a-willingness-to-test-and-learn/?utm_campaign=SkiftWeeklyReviewNewsletter&utm_source=hs_email&utm_medium=email&utm_content=53244701&_hsenc=p2ANqtz--3Fu3wJdXe_S4rGD8KhVQjb-vWPMvXMODGF9d-

RamyaBhaskarSundaram.(2018).UnderstandingtheMathbehindtheXGBoostAlgorithm.RetrievedJanuary6,2019,fromhttps://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide-to-understand-the-math-behind-xgboost/

Rauschnabel,P.A.,Brem,A.,&Ivens,B.S.(2015).Whowillbuysmartglasses?Empiricalresultsoftwopre-market-entrystudiesontheroleofpersonalityinindividualawarenessandintendedadoptionofGoogleGlasswearables.ComputersinHumanBehavior,49(May),635–647.https://doi.org/10.1016/j.chb.2015.03.003

SunilRay.(2017).EssentialsofMachineLearningAlgorithms(withPythonandRCodes).RetrievedJanuary6,2019,fromhttps://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/

20

Torres,E.N.,Singh,D.,&Robertson-Ring,A.(2015).Consumerreviewsandthecreationofbookingtransactionvalue:Lessonsfromthehotelindustry.InternationalJournalofHospitalityManagement.https://doi.org/10.1016/j.ijhm.2015.07.012

TrefisTeam.(2016).HereAreTheKeyGrowthAreasForPriceline’sBooking.com-Nasdaq.com.RetrievedJuly12,2017,fromhttp://www.nasdaq.com/g00/article/here-are-the-key-growth-areas-for-pricelines-bookingcom-cm723530?i10c.referrer=https%3A%2F%2Fwww.google.pt%2F

Wu,L.,Mattila,A.S.,&Hanks,L.(2015).Investigatingtheimpactofsurpriserewardsonconsumerresponses.InternationalJournalofHospitalityManagement,50,27–35.https://doi.org/10.1016/j.ijhm.2015.07.004

Zhang,L.(2016).Pricetrendsforcloudcomputingservices.WellesleyCollege.Retrievedfromhttp://repository.wellesley.edu/thesiscollection/386

21

8. APPENDIX

8.1. APPENDIX1–DESCRIPTIVEANALYSIS

Figure3-Searchwindowperweekpercontinent

Figure4-Searchwindowwhenbookingwithpackageorwithoutpackage

22

Figure5-Searchwindowifbookingincludesweekendordoesnotincludeweekend

Figure6-Searchwindowwhenbookingwithchildren

23

Figure7-Numberofbookingsincludingchildren

Figure8-Numberofbookingspercontinent

24

8.2. APPENDIX2–PREDICTIVEANALYSIS

Table5-VariablesExplanation

Table6-Models'results

25

Figure9-Variablesimportance

Figure10-Bookinglikelihood

26

8.3. APPENDIX3-AUTOMATION

Figure11-Automationworkflow

Recommended