Upload
trantuyen
View
227
Download
6
Embed Size (px)
Citation preview
TheAutomatedTravelAgent:MachineLearningforHotelClusterRecommendationMichaelArruza-Cruz,MichaelStraka,JohnPericich
Expediauserswhopreferthesametypesofhotelspresumablyshareothercommonalities(i.e.,non-hotelcommonalities)witheachother.Withthisinmind,KagglechallengeddeveloperstorecommendhotelstoExpediausers.Armedwithatrainingsetcontainingdataabout37millionExpediausers,wesetouttodojustthat.Ourmachine-learningalgorithmsrangedfromdirectapplicationsofmateriallearnedinclasstomulti-partalgorithmswithnovelcombinationsofrecommendersystemtechniques.Kaggle’sbenchmarkforrandomlyguessingauser’shotelclusteris0.02260,andthemeanaverageprecisionK=5valuefornaïverecommendersystemsis0.05949.Ourbestcombinationofmachine-learningalgorithmsachievedafigurejustover0.30.Ourresultsprovideinsightintoperformingmulti-classclassificationondatasetsthatlacklinearstructure.
• 37milliondataentriescorrespondingtouserdata,eachwithatotalof23featurescorrespondingtohoteldestination,numberofrooms,numberofchildren,lengthofstay,etc.
• Hotelclustersareanonymized,andonlyuserdataisgiven.Thismakestypicaluser-itemmatrixmethodssuchasAlternatingLeastSquaresimpossibletouse.
• Thedatasetisskewedtowardscertainhotelclustersoverothers;certainclustersareoverrepresentedwhileothersappearveryrarely.
• Likewise,somedestinationsappearveryfrequently,andothersappearonlyahandfuloftimes.
• Thedataisalsonotlinearlyseparable.
Baselinemethods:• OurfirstattemptinvolvedimplementingabasicmultinomialNaïveBayesclassifier
thatreturnsalistofthetoptenmostlikelyclustersforauser.Thismethodservedasourbaselinemovingforward.
• OursecondattemptinvolvedtheuseofasupportvectormachinewithanRBFkernel.ThisunderperformedcomparedtoNaïveBayes.Wesuspectitisdiffculttofitahyperplanetothedatawithoutusingparametersthatresultinoverfittingduetothelackoflinearseparability
GradientBoosting:
• Firstrealsuccessfoundbyusingensembleofdecisiontreesminimizingsoftmaxlossfunction
• Likelyduetointelligentlearningofnon-linearstructureinthedata,alongwithboosting’sresistancetooverfitting
• ConvergedmoreslowlythanSVMandNaïveBayesforincreasedvaluesofKinMapk,implyingitsrankingsaremorenuanced
KernelizedUserSimilarity(MostEffectiveMethod):
• First,weclusterthedatatogetherbasedondestination;trainingdatasharingthesamedestinationidaregroupedtogether.
• Foreachnewtestingexample,weretrievethetraininggroupwithamatchingdestinationidandcreateusersimilaritymatrices.Thematricesaremadeutilizingakernelfunction;wetriedthefollowingthreekernels:
𝐾"(𝑥, 𝑦) =)1{𝑥, = 𝑦,}.
,/"
exp(−𝑧6
2𝜏6)
Wherezistheplacementoftheuserintermsofsimilaritytothetestexample(firstmostsimilar,second,third,etc.)andtauis60.
𝐾6(𝑥, 𝑦) = 91𝑚)1{𝑥, = 𝑦,}
.
,/"
;<
wheree=5.
𝐾=(𝑥, 𝑦) =1𝑚>)1{𝑥, = 𝑦,}
.?
,/"
+ 𝑘)𝑥,𝑦,
.??
,/"
wherexandyaredividedintotwovectorsofsizem’andm’’,withdifferentfeaturesseparatedintoeach.
• Oncethesimilaritymatrixiscreated,findthetop150usersmostsimilartothetestexample,andforeachhotelclusterrepresentedinthe150userssumtheirsimilarityscore(determinedbythechosenkernelmethod).
• Then,recommendthetop10hotelclusters(bysimilarityscore)foundinthemostsimilarusers.
• Ofthethreekernelsmentionedabove,thesecondkernelprovedmosteffective,likelyduetoheavilydiscretizednatureoftheuserfeatures
Algorithm Precision Recall F1
SVM(RBFKernel)
0.00543661971831 0.0100680272109 0.000970935340416
NaïveBayes
0.0758968048912 0.0724648868325 0.0735346947801
GradientBoosting
0.123878702014 0.128351405483 0.108727388991
UserSimilarity:Kernel1
0.182515928508
0.182225684972
0.171697778825
UserSimilarity:Kernel3
0.183583426068
0.182550650282
0.171638835758
UserSimilarity:Kernel2
0.185584222149
0.184623438005
0.173430366799
Thisprojectprovidesanexcellentcasestudyforapplyingmachinelearningalgorithmstolargedatasetslackingobviousstructure.Italsoembodiesthechallengeofrecommendingitemsaboutwhichwehavenofeatures.Inaddressingthesechallenges,wedemonstratedthatacreativecombinationofusersimilaritymatricesandJaccardsimilarityoutperformsgradientboosting—atechniquecurrentlywell-knownforwinningKagglecompetitions.Forfuturework,werecommendusingensemblestackingmethodstocombinepredictionsfromvariousalgorithms.Furtherworkcouldalsoexploretuninghyper-parametersforgradientboosting.
MethodologyAbstract
DataandFeaturesResults
DiscussionandFutureWork
Figure1:Frequencyofhotelclustersindataset.
Figure2:PCAinthreedimensionsofdataforthreemostpopularhotelclusters.Whilenotlinearlyseparable,theredoesappeartobesomenon-linearstructuretothehotelclusters.Thesuccessofourmethodsbasedonusersimilaritysupportthis.
Overall,thebestmethodswerethemethodsthatutilizedusersimilarityandkernelstorecommendhotelclustersthatothersimilarusersbooked.GradientBoostingwasalsoeffective,butmeanaverageprecisionseemedtohitahardcapat.25regardlessoftheparametersused.TheSVMperformedverypoorly,asdidotherbasicmachinelearningmethodsattemptedonthedatainitially.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
SVM(RBFKernel)
GradientBoosting
UserSimilarity:Kernel1
UserSimilarity:Kernel2
UserSimilarity:Kernel3
NaiiveBayes
MeanAveragePrecision:5Predictions
MeanAveragePrecision:5Predictions