Deep Learning for Recommender Systems RecSys2017 Tutorial

DeepLearning for RecommenderSystems

AlexandrosKaratzoglou (ScientificDirector@TelefonicaResearch)[email protected],@alexk_z

BalázsHidasi (HeadofResearch@GravityR&D)[email protected],@balazshidasi

RecSys’17,29August2017,Como

WhyDeepLearning?

ImageNet challenge error rates (red line = human performance)

WhyDeepLearning?

ComplexArchitectures

NeuralNetworksareUniversalFunctionApproximators

InspirationforNeuralLearning

NeuralModel

Neurona.k.a.Unit

FeedforwardMultilayeredNetwork

Learning

StochasticGradientDescent

• Generalizationof(Stochastic)GradientDescent

StochasticGradientDescent

Backpropagation

Backpropagation

• Doesnotworkwellinplainanormal”multilayerdeepnetwork

• VanishingGradients• SlowLearning

• SVM’seasiertotrain• 2nd NeuralWinter

ModernDeepNetworks

• Ingredients:

• RectifiedLinearActivationfunctiona.k.a.ReLu

ModernDeepNetworks• Ingredients:

• Dropout:

ModernDeepNetworks• Ingredients:

• Mini-batches:– StochasticGradientDescent– Computegradientovermany(50-100)datapoints(minibatch)andupdate.

ModernFeedforwardNetworks• Ingredients:• Adagrad a.k.a.adaptivelearningrates

• Feature extraction directly from the content• Image,text,audio,etc.• Instead ofmetadata• For hybrid algorithms

• Heterogenous data handled easily• Dynamic/Sequential behaviour modeling with RNNs• Moreaccurate representation learning ofusers anditems

• Natural extension ofCF&more• RecSys isacomplex domain

• Deeplearning worked well inother complex domains• Worth atry

DeepLearning for RecSys

• As of2017summer,maintopics:• Learning item embeddings• Deepcollaborative filtering• Feature extraction directly from content• Session-based recommendations with RNN

• Andtheir combinations

ResearchdirectionsinDL-RecSys

• Startsimple• Addimprovements later

• Optimize code• GPU/CPUoptimizations may differ

• Scalability iskey• Opensource code• Experiment (also)on public datasets• Don’t use very small datasets• Don’t work on irrelevant tasks,e.g.rating prediction

Bestpractices

Itemembeddings&2vecmodels

Embeddings

• Embedding:a(learned)realvaluevectorrepresentinganentity– Alsoknownas:

• Latentfeaturevector• (Latent)representation

– Similarentities’embeddingsaresimilar• Useinrecommenders:– Initializationofitemrepresentationinmoreadvancedalgorithms

– Item-to-itemrecommendations

Matrixfactorizationaslearningembeddings

• MF:user&itemembeddinglearning– Similarfeaturevectors

• Twoitemsaresimilar• Twousersaresimilar• Userprefersitem

– MFrepresentationasasimplicticneuralnetwork• Input:one-hotencodeduserID• Inputtohiddenweights:userfeature

matrix• Hiddenlayer:userfeaturevector• Hiddentooutputweights:itemfeature

matrix• Output:preference(oftheuser)overthe

items

R UI

≈

0,0,...,0,1,0,0,...0

u

𝑟#,%, 𝑟#,&, … , 𝑟#,()

𝑊+

𝑊,

Word2Vec• [Mikolovet.al,2013a]• Representationlearningofwords• Shallowmodel• Data:(target)word+contextpairs– Slidingwindowonthedocument– Context=wordsnearthetarget

• Inslidingwindow• 1-5wordsinbothdirections

• Twomodels– ContinousBagofWords(CBOW)– Skip-gram

Word2Vec- CBOW• ContinuousBagofWords• Maximalizestheprobabilityofthetargetwordgiventhe

context• Model

– Input:one-hotencodedwords– Inputtohiddenweights

• Embeddingmatrixofwords– Hiddenlayer

• Sumoftheembeddingsofthewordsinthecontext– Hiddentooutputweights– Softmaxtransformation

• Smoothapproximationofthemaxoperator• Highlightsthehighestvalue

• 𝑠. =012

∑ 0145467

,(𝑟8:scores)

– Output:likelihoodofwordsofthecorpusgiventhecontext• Embeddingsaretakenfromtheinputtohiddenmatrix

– Hiddentooutputmatrixalsohasitemrepresentations(butnotused)

E E E E

𝑤:;& 𝑤:;% 𝑤:<% 𝑤:<&

word(t-2) word(t-1) word(t+2)word(t+1)

Classifier

averaging

0,1,0,0,1,0,0,1,0,1

𝑟. .=%>

𝐸

𝑊

𝑝(𝑤.|𝑐) .=%>

softmax

word(t)

Word2Vec– Skip-gram

• Maximalizestheprobabilityofthecontext,giventhetargetword

• Model– Input:one-hotencodedword– Inputtohiddenmatrix:embeddings– Hiddenstate

• Itemembeddingoftarget– Softmaxtransformation– Output:likelihoodofcontextwords

(giventheinputword)• Reportedtobemoreaccurate

E

𝑤:

word(t)

word(t-1) word(t+2)word(t+1)

Classifier

word(t-2)

0,0,0,0,1,0,0,0,0,0

𝑟. .=%>

𝐸

𝑊

𝑝(𝑤.|𝑐) .=%>

softmax

GeometryoftheEmbeddingSpace

King- Man+Woman=Queen

Paragraph2vec,doc2vec

• [Le&Mikolov,2014]• Learnsrepresentationofparagraph/document

• BasedonCBOWmodel• Paragraph/documentembeddingaddedtothemodelasglobalcontext

E E E E

𝑤:;& 𝑤:;% 𝑤:<% 𝑤:<&

word(t-2) word(t-1) word(t+2)word(t+1)

Classifier

word(t)

averaging

P

paragraphID

𝑝.

...2vecfor Recommendations

Replace words with items inasession/user profile

E E E E

𝑖:;& 𝑖:;% 𝑖:<% 𝑖:<&

item(t-2) item(t-1) item(t+2)item(t+1)

Classifier

item(t)

averaging

Prod2Vec• [Grbovicet.al,2015]• Skip-grammodelonproducts

– Input:i-thproductpurchasedbytheuser– Context:theotherpurchasesoftheuser

• Baggedprod2vecmodel– Input:productspurchasedinonebasketbytheuser

• Basket:sumofproductembeddings– Context:otherbasketsoftheuser

• Learninguserrepresentation– Followsparagraph2vec– Userembeddingaddedasglobalcontext– Input:user+productspurchasedexceptforthei-th– Target:i-thproductpurchasedbytheuser

• [Barkan&Koenigstein,2016]proposedthesamemodellaterasitem2vec– Skip-gramwithNegativeSampling(SGNS)isappliedtoeventdata

Prod2Vec

[Grbovicet.al,2015]

pro2vecskip-grammodelonproducts

Bagged Prod2Vec

[Grbovicet.al,2015]

bagged-prod2vecmodelupdates

User-Prod2Vec

[Grbovicet.al,2015]

Userembeddings forusertoproductpredictions

Utilizingmoreinformation• Meta-Prod2vec[Vasileet.al,2016]

– Basedontheprod2vecmodel– Usesitemmetadata

• Embeddedmetadata• Addedtoboththeinputandthecontext

– Lossesbetween:target/contextitem/metadata• Finallossisthecombinationof5oftheselosses

• Content2vec[Nedelecet.al,2017]– Separate modules formultimodelinformation

• CF:Prod2vec• Image:AlexNet(atypeofCNN)• Text:Word2VecandTextCNN

– Learnspairwisesimilarities• Likelihoodoftwoitemsbeingboughttogether

I

𝑖:

item(t)

item(t-1) item(t+2)meta(t+1)

Classifier

meta(t-1)

M

𝑚:

meta(t)

Classifier Classifier Classifier Classifier

item(t)

References• [Barkan&Koenigstein,2016]O.Barkan,N.Koenigstein:ITEM2VEC:Neuralitemembeddingfor

collaborativefiltering.IEEE26thInternationalWorkshoponMachineLearningforSignalProcessing(MLSP2016).

• [Grbovicet.al,2015]M.Grbovic,V.Radosavljevic,N.Djuric,N.Bhamidipati,J.Savla,V.Bhagwan,D.Sharp:E-commerceinYourInbox:ProductRecommendationsatScale.21thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining (KDD’15).

• [Le&Mikolov,2014]Q.Le,T.Mikolov:DistributedRepresentationsofSentencesandDocuments.31stInternationalConferenceonMachineLearning(ICML2014).

• [Mikolovet.al,2013a]T.Mikolov,K.Chen,G.Corrado,J.Dean:EfficientEstimationofWordRepresentationsinVectorSpace.ICLR2013Workshop.

• [Mikolovet.al,2013b]T.Mikolov,I.Sutskever,K.Chen,G.Corrado,J.Dean:DistributedRepresentationsofWordsandPhrasesandTheirCompositionality.26thAdvancesinNeuralInformationProcessingSystems(NIPS2013).

• [Morin&Bengio,2005]F. Morin,Y. Bengio: Hierarchicalprobabilisticneuralnetworklanguagemodel.International workshoponartificialintelligenceandstatistics,2005.

• [Nedelecet.al,2017]T.Nedelec,E.Smirnova,F.Vasile:SpecializingJointRepresentationsforthetaskofProductRecommendation.2ndWorkshoponDeepLearningforRecommendations(DLRS2017).

• [Vasileet.al,2016]F.Vasile,E.Smirnova,A.Conneau:Meta-Prod2Vec– ProductEmbeddingsUsingSide-InformationforRecommendations.10thACMConferenceonRecommenderSystems(RecSys’16).

DeepCollaborativeFiltering

CFwithNeuralNetworks• Naturalapplicationarea• SomeexplorationduringtheNetflixprize• E.g.:NSVD1[Paterek,2007]

– AsymmetricMF– Themodel:

• Input:sparsevectorofinteractions– Item-NSVD1:ratingsgivenfortheitembyusers

» Alternatively:metadataoftheitem– User-NSVD1:ratingsgivenbytheuser

• Inputtohiddenweights:„secondary”featurevectors• Hiddenlayer:item/userfeaturevector• Hiddentooutputweights:user/itemfeaturevectors• Output:

– Item-NSVD1:predictedratingsontheitembyallusers– User-NSVD1:predictedratingsoftheuseronallitems

– TrainingwithSGD– Implicitcounterpartby[Pilászyet.al,2009]– Nonon-linaritiesinthemodel

Ratingsoftheuser

Userfeatures

Predictedratings

Secondaryfeaturevectors

Itemfeaturevectors

RestrictedBoltzmannMachines(RBM)forrecommendation

• RBM– Generativestochasticneuralnetwork– Visible&hiddenunitsconnectedby(symmetric)weights

• Stochasticbinaryunits• Activationprobabilities:

– 𝑝 ℎ8 = 1 𝑣 = 𝜎 𝑏8L + ∑ 𝑤.,8𝑣.N.=%

– 𝑝 𝑣. = 1 ℎ = 𝜎 𝑏.O + ∑ 𝑤.,8ℎ8P8=%

– Training• Setvisibleunitsbasedondata• Samplehiddenunits• Samplevisibleunits• Modifyweightstoapproachtheconfigurationofvisibleunitstothedata

• Inrecommenders[Salakhutdinovet.al,2007]– Visibleunits:ratingsonthemovie

• Softmaxunit– Vectoroflength5(foreachratingvalue)ineachunit– Ratingsareone-hotencoded

• Unitscorrenpondingtouserswhonotratedthemovieareignored– Hiddenbinaryunits

ℎQℎ&ℎ%

𝑣R𝑣S𝑣Q𝑣% 𝑣&

ℎQℎ&ℎ%


𝑟.: 2??41

DeepBoltzmannMachines(DBM)• Layer-wisetraining– TrainweightsbetweenvisibleandhiddenunitsinanRBM

– Addanewlayerofhiddenunits

– Trainweightsconnectingthenewlayertothenetwork• Allotherweights(e.g.visible-hiddenweights)arefixed

ℎQ%ℎ&%ℎ%%


ℎQ%ℎ&%ℎ%%


ℎ&&ℎ%&

Train

Train

Fixed

ℎQ%ℎ&%ℎ%%


ℎQ&ℎ&&Train

Fixed

ℎ&Qℎ%Q ℎSQℎQQ

Fixed

Autoencoders

• Autoencoder– Onehiddenlayer– Samenumberofinputandoutputunits– Trytoreconstructtheinputontheoutput– Hiddenlayer:compressedrepresentationofthedata

• Constrainingthemodel:improvegeneralization– Sparseautoencoders

• Activationsofunits are limited• Activationpenalty• Requiresthewholetrainsettocompute

– Denoisingautoencoders[Vincentet.al,2008]• Corrupttheinput(e.g.setrandomvaluestozero)• Restoretheoriginalontheoutput

• Deepversion– Stackedautoencoders– Layerwisetraining(historically)– End-to-endtraining(morerecently)

Data

Corruptedinput

Hiddenlayer

Reconstructedoutput

Data

Autoencodersforrecommendation

• Reconstructcorrupteduserinteractionvectors– CDL[Wanget.al,2015]Collaborative DeepLearningUses Bayesianstacked denoising autoencodersUses tags/metadatainsteadoftheitem ID

Autoencodersforrecommendation

• Reconstructcorrupteduserinteractionvectors– CDAE[Wuet.al,2016]Collaborative Denoising Auto-Encoder

Additional usernodeon theinputandbiasnode besidethe hidden layer

Recurrentautoencoder

• CRAE[Wanget.al,2016]– CollaborativeRecurrentAutoencoder– Encodestext(e.g.movieplot,review)– AutoencodingwithRNNs• Encoder-decoderarchitecture• TheinputiscorruptedbyreplacingwordswithadeisgnatedBLANKtoken

– CDLmodel+textencodingsimultaneously• Jointlearning

DeepCF methods

• MV-DNN[Elkahkyet.al,2015]– Multi-domainrecommender– Separatefeedforwardnetworksforuseranditems perdomain

(D+1networks)• Featuresfirstareembedded• Run through several layers

DeepCF methods

• TDSSM[Songet.al,2016]• Temporal DeepSemantic Structured Model• Similar to MV-DNN• User featuresarethecombinationofastaticandatemporal part• ThetimedependentpartismodeledbyanRNN

DeepCF methods

• Coevolving features[Daiet.al,2016]• Users’tasteanditems’audienceschangeovertime• User/item featuresdependon time andare composedof

• Timedriftvector• Selfevolution• Co-evolutionwithitems/users• Interaction vectorFeature vectorsarelearnedbyRNNs

DeepCF methods

• ProductNeuralNetwork(PNN)[Quet.al,2016]– ForCTRestimation– Embed features– Pairwiselayer:allpairwise combination

ofembeddedfeatures• LikeFactorizationMachines• Outer/innerproductoffeaturevectorsorboth

– Severalfullyconnectedlayers• CF-NADE[Zhenget.al,2016]

– NeuralAutoregressiveCollaborativeFiltering– Usereventsà preference(0/1)+confidence(basedonoccurence)– Reconstructssomeoftheusereventsbasedonothers(notthefullset)

• Randomorderingofuserevents• Reconstructthepreferencei,basedonpreferencesandconfidencesuptoi-1

– Lossisweightedbyconfidences

Applications:apprecommendations

• Wide&DeepLearning[Chenget.al,2016]• Rankingofresultsmatchingaquery• Combinationoftwomodels

– Deepneuralnetwork• Onembeddeditemfeatures• „Generalization”

– Linearmodel• Onembeddeditemfeatures• Andcrossproductofitemfeatures• „Memorization”

– Jointtraining– Logisticloss

• Improvedonlineperformance– +2.9%deepoverwide– +3.9%deep+wideoverwide

Applications:videorecommendations

• YouTubeRecommender[Covingtonet.al,2016]– Twonetworks– Candidategeneration

• Recommendationsasclassification– Itemsclicked/notclickedwhenwererecommended

• Feedforwardnetworkonmanyfeatures– Averagewatchembeddingvectorofuser(lastfewitems)– Averagesearchembeddingvectorofuser(lastfewsearches)– Userattributes– Geographicembedding

• Negativeitemsampling+softmax– Reranking

• Morefeatures– Actualvideoembedding– Averagevideoembeddingofwatchedvideos– Languageinformation– Timesincelastwatch– Etc.

• Weightedlogisticregressiononthetopofthenetwork

References• [Chenget.al,2016]HT.Cheng,L.Koc,J.Harmsen,T.Shaked,T.Chandra,H.Aradhye,G.Anderson,G.Corrado,W.Chai,M.Ispir,R.

Anil, Z. Haque, L. Hong, V. Jain, X. Liu, H. Shah:Wide&DeepLearningforRecommenderSystems.1stWorkshoponDeepLearningforRecommenderSystems(DLRS2016).

• [Covingtonet.al,2016]P.Covington,J.Adams,E.Sargin:DeepNeuralNetworksforYouTubeRecommendations.10thACMConferenceonRecommenderSystems(RecSys’16).

• [Daiet.al,2016]H.Dai,Y.Wang,R.Trivedi,L.Song:RecurrentCo-EvolutionaryLatentFeatureProcessesforContinuous-timeRecommendation.1stWorkshoponDeepLearningforRecommenderSystems(DLRS2016).

• [Elkahkyet.al,2015]A.M.Elkahky,Y.Song,X.He:AMulti-ViewDeepLearningApproachforCrossDomainUserModelinginRecommendationSystems.24thInternationalConferenceonWorldWideWeb (WWW’15).[Paterek,2007]A.Paterek:Improvingregularizedsingularvaluedecompositionforcollaborativefiltering.KDDCupandWorkshop 2007.

• [Paterek,2007]A.Paterek:Improvingregularizedsingularvaluedecompositionforcollaborativefiltering.KDDCup2007Workshop.• [Pilászy&Tikk,2009]I.Pilászy,D.Tikk:Recommendingnewmovies:evenafewratingsaremorevaluablethanmetadata.3rdACM

ConferenceonRecommenderSystems(RecSys’09).• [Quet.al,2016]Y.Qu,H.Cai,K.Ren,W.Zhang,Y.Yu:Product-basedNeuralNetworksforUserResponse Prediction. 16thInternational

Conferenceon DataMining(ICDM 2016).• [Salakhutdinovet.al,2007]R.Salakhutdinov,A.Mnih,G.Hinton:RestrictedBoltzmannMachines forCollaborativeFiltering.24th

International Conference onMachineLearning (ICML2007).• [Songet.al,2016]Y.Song,A.M.Elkahky,X.He:Multi-RateDeepLearningforTemporalRecommendation.39thInternationalACM

SIGIRconferenceonResearchandDevelopmentinInformationRetrieval (SIGIR’16).• [Vincentet.al,2008]P.Vincent,H.Larochelle,Y.Bengio, P. A.Manzagol: Extracting andComposing Robust Features withDenoising

Autoencoders.25thinternationalConference onMachine Learning (ICML2008).• [Wanget.al,2015]H.Wang,N.Wang,DY.Yeung:CollaborativeDeepLearningforRecommenderSystems.21thACMSIGKDD

InternationalConferenceonKnowledgeDiscoveryandDataMining(KDD’15).• [Wanget.al,2016]H.Wang,X.Shi,DY.Yeung:CollaborativeRecurrentAutoencoder: RecommendwhileLearningtoFillintheBlanks.

AdvancesinNeuralInformationProcessingSystems(NIPS2016).• [Wuet.al,2016]Y.Wu,C.DuBois,A.X.Zheng,M.Ester:CollaborativeDenoisingAuto-encodersforTop-nRecommenderSystems.9th

ACMInternationalConferenceonWebSearchandDataMining(WSDM’16)• [Zhenget.al,2016]Y.Zheng,C.Liu,B.Tang,H.Zhou:NeuralAutoregressiveCollaborativeFilteringforImplicitFeedback. 1stWorkshop

onDeepLearningforRecommenderSystems(DLRS2016).

FeatureExtractionfromContent

Contentfeaturesinrecommenders• HybridCF+CBFsystems

– Interactiondata+metadata• Modelbasedhybridsolutions

– Initiliazing• Obtainitemrepresentationbasedonmetadata• Usethisrepresentationasinitialitemfeatures

– Regularizing• Obtainmetadatabasedrepresentations• Theinteractionbasedrepresentationshouldbeclosetothemetadatabased• Addregularizingtermtolossofthisdifference

– Joining• Obtainmetadatabasedrepresentations• Havetheitemfeaturevectorbeaconcatenation

– Fixedmetadatabasedpart– Learnedinteractionbasedpart

Featureextractionfromcontent• Deeplearningiscapableofdirectfeatureextraction

– Workwithcontentdirectly– Instead(orbeside)metadata

• Images– E.g.:productpictures,videothumbnails/frames– Extraction:convolutionalnetworks– Applications(e.g.):

• Fashion• Video

• Text– E.g.:productdescription,contentoftheproduct,reviews– Extraction

• RNNs• 1Dconvolutionnetworks• Weightedwordembeddings• Paragraphvectors

– Applications(e.g.):• News• Books• Publications

• Music/audio– Extraction:convolutionalnetworks(orRNNs)

ConvolutionalNeuralNetworks(CNN)

• Specialityofimages– Hugeamountofinformation• 3channels(RGB)• Lotsofpixels• Numberofweightsrequiredtofullyconnecta320x240imageto2048hiddenunits:– 3*320*240*2048=471,859,200

– Locality• Objects’presenceareindependentoftheirlocationororientation• Objectsarespatiallyrestricted

ConvolutionalNeuralNetworks(CNN)• Imageinput

– 3Dtensor• Width• Height• Channels(R,G,B)

• Text/sequenceinputs– Matrix– ofone-hotencodedentities

• Inputsmustbeofsamesize– Padding

• (Classic)ConvolutionalNets– Convolutionlayers– Poolinglayers– Fullyconnectedlayers

ConvolutionalNeuralNetworks(CNN)• Convolutionallayer(2D)

– Filter• Learnableweights,arrangedinasmall tensor(e.g.3x3xD)

– Thetensor’sdepthequalstothedepthoftheinput• Recognizes certainpatternsontheimage

– Convolutionwithafilter• Applythefilteronregionsoftheimage

– 𝑦V,W = 𝑓 ∑ 𝑤.,8,Y𝐼.<V;%,8<W;%,Y�.,8,Y

» Filtersareappliedoverallchannels(depthoftheinputtensor)» ActivationfunctionisusuallysomekindofReLU

– Startfromtheupperleftcorner– Moveleftbyoneandapplyagain– Oncereachingtheend,gobackandshiftdownbyone

• Result:a2Dmapofactivations,highatplacescorrespondingtothepatternrecognizedbythefilter– Convolutionlayer:multiplefiltersofthesamesize

• Inputsize(𝑊%×𝑊&×𝐷)• Filtersize(𝐹×𝐹×𝐷)• Stride(shiftvalue)(𝑆)• Numberoffilters(𝑁)• Outputsize: a7;b

(+ 1 × ac;b

(+ 1 ×𝑁

• Numberofweights:𝐹×𝐹×𝐷×𝑁– Anotherwaytolookatit:

• Hiddenneuronsorganizedina a7;b(

+ 1 × ac;b(

+ 1 ×𝑁 tensor• Weightsasharedbetweenneuronswiththesamedepth• Aneuronprocessean𝐹×𝐹×𝐷 regionoftheinput• Neighboringneuronsprocessregionsshiftedbythestridevalue

1 3 8 00 7 2 1

2 5 5 1

4 2 3 0

-1 -2 -1

-2 12 -2

-1 -2 -1

48 -27

19 28

ConvolutionalNeuralNetworks(CNN)• Poolinglayer

– Meanpooling:replacean𝑅×𝑅 regionwiththemeanofthevalues– Maxpooling:replacean𝑅×𝑅 regionwiththemaximumofthevalues– Usedtoquicklyreducethesize– Cheap,butveryaggressiveoperator

• Avoidwhenpossible• Oftenneeded,becauseconvolutionsdon’tdecreasethenumberofinputsfastenough

– Inputsize:𝑊%×𝑊&×𝑁– Outputsize:a7

e×ac

e×𝑁

• Fullyconnectedlayers– Finalfewlayers– Eachhiddenneuronisconnectedwitheveryneuroninthenextlayer

• Residualconnections(improvement)[Heet.al,2016]– Verydeepnetworksdegradeperformance– Hardtofindthepropermappings– Reformulationoftheproblem:F(x)à F(x)+x

Layer

Layer

+

𝑥

𝐹 𝑥 + 𝑥

𝐹(𝑥)

ConvolutionalNeuralNetworks(CNN)

• Someexamples• GoogLeNet[Szegedyet.al,2015]

• Inception-v3model[Szegedyet.al,2016]

• ResNet(upto200+layers)[Heet.al,2016]

Imagesinrecommenders• [McAuleyet.Al,2015]– Learnsaparameterizeddistancemetricovervisualfeatures• VisualfeaturesareextractedfromapretrainedCNN• Distancefunction:Euclediandistanceof„embedded”visualfeatures– Embeddinghere:multiplicationwithaweightmatrixtoreducethenumberofdimensions

– Personalizeddistance• Reweightsthedistancewithauserspecificweightvector

– Training:maximizinglikelihoodofanexistingrelationshipwiththetargetitem• Overuniformlysamplednegativeitems

Imagesinrecommenders• VisualBPR[He&McAuley,2016]

– Modelcomposedof• Biasterms• MFmodel• Visualpart

– PretrainedCNNfeatures– Dimensionreductionthrough„embedding”– Theproductofthisvisualitemfeatureandalearneduserfeaturevectorisusedinthe

model• Visualbias

– ProductofthepretrainedCNNfeaturesandaglobalbiasvectoroveritsfeatures

– BPRloss– Testedonclothingdatasets(9-25%improvement)

Musicrepresentations• [Oordet.al,2013]

– ExtendsiALS/WMFwithaudiofeatures• Toovercomecold-start

– Musicfeatureextraction• Time-frequencyrepresentation• AppliedCNNon3second

samples• Latentfactoroftheclip:average

predictionsonconsecutivewindowsoftheclip

– IntegrationwithMF• (a)Minimizedistancebetween

musicfeaturesandtheMF’sfeaturevectors

• (b)Replacetheitemfeatureswiththemusicfeatures(minimizeoriginalloss)

Textualinformationimprovingrecommendations

• [Bansalet.al,2016]– Paperrecommendation– Itemrepresentation

• Textrepresentation– TwolayerGRU(RNN):bidirectionallayerfollowedbyaunidirectionallayer– Representationiscreatedbypoolingoverthehiddenstatesofthesequence

• IDbasedrepresentation(itemfeaturevector)• Finalrepresentation:ID+textadded

– Multi-tasklearning• Predictbothuserscores• Andlikelihoodoftags

– End-to-endtraining• Allparametersaretrainedsimultaneously(nopretraining)• Loss

– Userscores:weightedMSE(likeiniALS)– Tags:weightedloglikelihood(unobservedtagsaredownweighted)

References• [Bansalet.al,2016]T.Bansal,D.Belanger,A.McCallum:AsktheGRU:Multi-Task

LearningforDeepTextRecommendations.10thACMConferenceonRecommenderSystems(RecSys’16).

• [Heet.al,2016]K.He,X.Zhang,S.Ren,J.Sun:DeepResidualLearningforImageRecognition.CVPR2016.

• [He&McAuley,2016]R.He,J.McAuley:VBPR:VisualBayesianPersonalizedRankingfromImplicitFeedback.30thAAAIConferenceonArtificialIntelligence(AAAI’16).

• [McAuleyet.Al,2015]J.McAuley,C.Targett,Q.Shi,A.Hengel:Image-basedRecommendationsonStylesandSubstitutes.38thInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval (SIGIR’15).

• [Oordet.al,2013]A.Oord,S.Dieleman,B.Schrauwen:DeepContent-basedMusicRecommendation.AdvancesinNeuralInformationProcessingSystems(NIPS2013).

• [Szegedyet.al,2015]C.Szegedy,W.Liu,Y.Jia,P.Sermanet,S.Reed,D.Anguelov,D.Erhan,V.Vanhoucke,A.Rabinovich:GoingDeeperwithConvolutions.CVPR2015.

• [Szegedyet.al,2016]C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,Z.Wojna:RethinkingtheInceptionArchitectureforComputerVision.CVPR2016.

Session-basedRecommendationswithRNNs

RecurrentNeuralNetworks

• Input:sequentialinformation( 𝑥: :=%g )

• Hiddenstate(ℎ:):– representationofthesequencesofar– influencedbyeveryelementofthesequenceuptot

• ℎ: = 𝑓 𝑊𝑥: + 𝑈ℎ:;% + 𝑏

RNN-basedmachinelearning• Sequencetovalue

– Encoding,labeling– E.g.:timeseriesclassification

• Valuetosequence– Decoding,generation– E.g.:sequencegeneration

• Sequencetosequence– Simultaneous

• E.g.:next-clickprediction– Encoder-decoderarchitecture

• E.g.:machinetranslation• TwoRNNs(encoder&decoder)

– Encoderproducesavectordescribingthesequence» Lasthiddenstate» Combinationofhiddenstates(e.g.meanpooling)» Learnedcombinationofhiddenstates

– Decoderreceivesthesummaryandgeneratesanewsequence» Thegeneratedsymbolisusuallyfedbacktothedecoder» Thesummaryvectorcanbeusedtoinitializethedecoder» Orcanbegivenasaglobalcontext

• Attentionmechanism(optionally)

ℎ% ℎ& ℎQ

𝑥% 𝑥& 𝑥Q

𝑦

ℎ% ℎ& ℎQ

𝑥

𝑦% 𝑦& 𝑦Q

ℎ% ℎ& ℎQ

𝑥% 𝑥& 𝑥Q

𝑦% 𝑦& 𝑦Q

ℎ%0 ℎ&0 ℎQ0

𝑥% 𝑥& 𝑥Q

𝑦% 𝑦& 𝑦Q

ℎ%i ℎ&i ℎQi

𝑠

𝑠 𝑠 𝑠𝑦% 𝑦&0

Exploding/Vanishinggradients

• ℎ: = 𝑓 𝑊𝑥: + 𝑈ℎ:;% + 𝑏• Gradientofℎ: wrt.𝑥%– Simplification:linearactivations

• Inreality:bounded

– kLlkm7

= kLlkLln7

kLln7kLlnc

⋯ kLckL7

kL7km7

= 𝑈:;%𝑊• 𝑈 & < 1à vanishinggradients

– Theeffectofvaluesfurtherinthepastisneglected– Thenetworkforgets

• 𝑈 & > 1à explodinggradients– Gradientsbecomeverylargeonlongersequences– Thenetworkbecomesunstable

Handlingexplodinggradients• Gradientclipping– Ifthegradientislargerthanathreshold,scaleitbacktothethreshold

– Updatesarenotaccurate– Vanishinggradientsarenotsolved

• Enforce 𝑈 & = 1– UnitaryRNN– Unabletoforget

• Gatednetworks– Long-ShortTermMemory(LSTM)– GatedRecurrentUnit(GRU)– (andasomeothervariants)

Long-ShortTermMemory(LSTM)• [Hochreiter&Schmidhuber,1999]• Insteadofrewritingthehiddenstateduringupdate,

addadelta– 𝑠: = 𝑠:;% + Δ𝑠:– Keepsthecontributionofearlierinputsrelevant

• Informationflowiscontrolledbygates– Gatesdependoninputandthehiddenstate– Between0and1– Forgetgate(f):0/1à reset/keephiddenstate– Inputgate(i):0/1à don’t/doconsiderthecontributionof

theinput– Outputgate(o):howmuchofthememoryiswrittentothe

hiddenstate• Hiddenstateisseparatedintotwo(readbeforeyou

write)– Memorycell(c):internalstateoftheLSTMcell– Hiddenstate(h):influencesgates,updatedfromthe

memorycell

𝑓: = 𝜎 𝑊s𝑥: + 𝑈sℎ:;% + 𝑏s𝑖: = 𝜎 𝑊.𝑥: + 𝑈.ℎ:;% + 𝑏.𝑜: = 𝜎 𝑊u𝑥: + 𝑈uℎ:;% + 𝑏u

�̃�: = tanh 𝑊𝑥: + 𝑈ℎ:;% + 𝑏𝑐: = 𝑓: ∘ 𝑐:;% + 𝑖: ∘ �̃�:ℎ: = 𝑜: ∘ tanh 𝑐:

𝐶

ℎ

IN

OUT

+

+

i

f

o

GatedRecurrentUnit(GRU)• [Choet.al,2014]• Simplifiedinformationflow

– Singlehiddenstate– Inputandforgetgatemergedà

updategate(z)– Nooutputgate– Resetgate(r)tobreak

informationflowfromprevioushiddenstate

• SimilarperformancetoLSTM ℎr

IN

OUT

z

+

𝑧: = 𝜎 𝑊~𝑥: + 𝑈~ℎ:;% + 𝑏~𝑟: = 𝜎 𝑊�𝑥: + 𝑈�ℎ:;% + 𝑏�

ℎ�: = tanh 𝑊𝑥: + 𝑟: ∘ 𝑈ℎ:;% + 𝑏ℎ: = 𝑧: ∘ ℎ: + 1 − 𝑧: ∘ ℎ�:

Session-basedrecommendations• Sequenceofevents– Useridentificationproblem– Disjointsessions(insteadofconsistentuserhistory)

• Tasks– Nextclickprediction– Predictingintent

• Classicalgorithmscan’tcopewithitwell– Item-to-itemrecommendationsasapproximationinlivesystems

• ArearevitalizedbyRNNs

GRU4Rec(1/3)• [Hidasiet.al,2015]• Networkstructure

– Input:onehotencodeditemID– Optionalembeddinglayer– GRUlayer(s)– Output:scoresoverallitems– Target:thenextiteminthesession

• AdaptingGRUtosession-basedrecommendations– Sessionsof(very)differentlength&lotsofshort

sessions:session-parallelmini-batching– Lotsofitems(inputs,outputs):samplingonthe

output– Thegoalisranking:listwiselossfunctionson

pointwise/pairwisescores

GRUlayer

One-hotvector

Weightedoutput

Scoresonitems

f()

One-hotvector

ItemID(next)

ItemID

GRU4Rec(2/3)• Session-parallelmini-batches

– Mini-batchisdefinedoversessions– UpdatewithonestepBPTT

• Lotsofsessionsareveryshort• 2Dmini-batching,updatingonlonger

sequences(withorwithoutpadding)didn’timproveaccuracy

• Outputsampling– Computingscoresforallitems(100K– 1M)in

everystepisslow– Onepositiveitem(target)+severalsamples– Fastsolution:scoresonmini-batchtargets

• Itemsoftheothermini-batcharenegativesamplesforthecurrentmini-batch

• Lossfunctions– Cross-entropy+softmax– AverageofBPRscores– TOP1score(averageofrankingerror+

regularizationoverscorevalues)

𝑖%,% 𝑖%,& 𝑖%,Q 𝑖%,S𝑖&,% 𝑖&,& 𝑖&,Q𝑖Q,% 𝑖Q,& 𝑖Q,Q 𝑖Q,S 𝑖Q,R 𝑖Q,�𝑖S,% 𝑖S,&𝑖R,% 𝑖R,& 𝑖R,Q

Session1

Session2

Session3

Session4

Session5

𝑖%,% 𝑖%,& 𝑖%,Q𝑖&,% 𝑖&,&𝑖Q,% 𝑖Q,& 𝑖Q,Q 𝑖Q,S 𝑖Q,R

𝑖S,%

𝑖R,% 𝑖R,&Input

Desired output

…

𝑖%,& 𝑖%,Q 𝑖%,S𝑖&,& 𝑖&,Q𝑖Q,& 𝑖Q,Q 𝑖Q,S 𝑖Q,R 𝑖Q,�

𝑖S,&

𝑖R,& 𝑖R,Q…

𝑖% 𝑖R 𝑖�

𝑦�%% 𝑦�&% 𝑦�Q% 𝑦�S% 𝑦�R% 𝑦��% 𝑦��% 𝑦��%

𝑦�%Q 𝑦�&Q 𝑦�QQ 𝑦�SQ 𝑦�RQ 𝑦��Q 𝑦��Q 𝑦��Q𝑦�%& 𝑦�&& 𝑦�Q& 𝑦�S& 𝑦�R& 𝑦��& 𝑦��& 𝑦��&

1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1

0 0 0 0 1 0 0 0

𝑋𝐸 = − log 𝑠. ,𝑠. =0��2

∑ 0��45�467

𝐵𝑃𝑅 =−∑ log 𝜎 𝑦�. − 𝑦�8

>�8=%

𝑁(

𝑇𝑂𝑃1 =∑ 𝜎 𝑦�8 − 𝑦�.>�8=% + ∑ 𝜎 𝑦�8&

>�8=%

𝑁(

GRU4Rec(3/3)• Observations– Similaraccuracywith/withoutembedding– Multiplelayersrarelyhelp

• Sometimesslightimprovementwith2layers• Sessionsspanovershorttime,noneedformultipletimescales

– Quickconversion:onlysmallchangesafter5-10epochs– Upperboundformodelcapacity

• Noimprovementwhenaddingadditionalunitsafteracertainthreshold

• Thisthresholdcanbeloweredwithsometechniques• Results– 20-30%improvementoveritem-to-itemrecommendations

ImprovingGRU4Rec• Recall@20onRSC15byGRU4Rec:0.6069 (100units),0.6322 (1000units)• Dataaugmentation[Tanet.al,2016]

– Generateadditionalsessionsbytakingeverypossiblesequencestartingfromtheendofasession– Randomlyremoveitemsfromthesesequences– Longtrainingtimes– Recall@20onRSC15(usingthefulltrainingsetfortraining):~0.685 (100units)

• Bayesianversion(ReLeVar)[Chatziset.al,2017]– Bayesianformulationofthemodel– Basicallyadditionalregularizationbyaddingrandomnoiseduringsampling

– Recall@20onRSC15:0.6507 (1500units)• Newlossesandadditionalsampling[Hidasi&Karatzoglou,2017]

– Useadditionalsamplesbesideminibatchsamples– Designbetterlossfunctions

• BPR�� = − log ∑ 𝑠8𝜎 𝑟. − 𝑟8>�8=% + 𝜆∑ 𝑟8&

>�8=%

– Recall@20onRSC15:0.7119 (100units)

Extensions• Multi-modalinformation(p-RNNmodel)[Hidasiet.al,2016]

– UseimageanddescriptionbesidestheitemID– OneRNNperinformationsource– Hiddenstatesconcatenated– Alternatingtraining

• Itemmetadata[Twardowski,2016]– Embeditemmetadata– MergewiththehiddenlayeroftheRNN(sessionrepresentation)– Predictcompatibilityusingfeedforwardlayers

• Contextualization[Smirnova&Vasile,2017]– Mergingbothcurrentandnextcontext– Currentcontextontheinputmodule– Nextcontextontheoutputmodule– TheRNNcellisredefinedtolearncontext-awaretransitions

• Personalizingbyinter-sessionmodeling– HierarchicalRNNs[Quadranaet.al,2017],[Ruoccoet.al,2017]

• OneRNNworkswithinthesession(nextclickprediction)• TheotherRNNpredictsthetransitionbetweenthesessionsoftheuser

References• [Chatziset.al,2017]S.P.Chatzis,P.Christodoulou,A.Andreou:RecurrentLatentVariableNetworksforSession-Based

Recommendation.2ndWorkshoponDeepLearningforRecommenderSystems(DLRS2017).https://arxiv.org/abs/1706.04026

• [Choet.al,2014]K.Cho,B.vanMerrienboer,D.Bahdanau,Y.Bengio.Onthepropertiesofneuralmachinetranslation:Encoder-decoderapproaches.https://arxiv.org/abs/1409.1259

• [Hidasiet.al,2015]B.Hidasi,A.Karatzoglou,L.Baltrunas,D.Tikk:Session-basedRecommendationswithRecurrentNeuralNetworks.InternationalConferenceonLearningRepresentations(ICLR2016).https://arxiv.org/abs/1511.06939

• [Hidasiet.al,2016]B.Hidasi,M.Quadrana,A.Karatzoglou,D.Tikk:ParallelRecurrentNeuralNetworkArchitecturesforFeature-richSession-basedRecommendations.10thACMConferenceonRecommenderSystems(RecSys’16).

• [Hidasi&Karatzoglou,2017]B.Hidasi,AlexandrosKaratzoglou:RecurrentNeuralNetworkswithTop-kGainsforSession-basedRecommendations.https://arxiv.org/abs/1706.03847

• [Hochreiter&Schmidhuber,1997]S.Hochreiter,J.Schmidhuber:LongShort-termMemory.NeuralComputation,9(8):1735-1780.

• [Quadranaet.al,2017]:M.Quadrana,A.Karatzoglou,B.Hidasi,P.Cremonesi:PersonalizingSession-basedRecommendationswithHierarchicalRecurrentNeuralNetworks.11thACMConferenceonRecommenderSystems(RecSys’17).https://arxiv.org/abs/1706.04148

• [Ruoccoet.al,2017]:M.Ruocco,O.S.LillestølSkrede,H.Langseth:Inter-SessionModelingforSession-BasedRecommendation.2ndWorkshoponDeepLearningforRecommendations(DLRS2017).https://arxiv.org/abs/1706.07506

• [Smirnova&Vasile,2017]E.Smirnova,F.Vasile:ContextualSequenceModelingforRecommendationwithRecurrentNeuralNetworks.2ndWorkshoponDeepLearningforRecommenderSystems(DLRS2017).https://arxiv.org/abs/1706.07684

• [Tanet.al,2016]Y.K.Tan,X.Xu,Y.Liu:ImprovedRecurrentNeuralNetworksforSession-basedRecommendations.1stWorkshoponDeepLearningforRecommendations(DLRS2016).https://arxiv.org/abs/1606.08117

• [Twardowski,2016]B.Twardowski:ModellingContextualInformationinSession-AwareRecommenderSystemswithNeuralNetworks.10thACMConferenceonRecommenderSystems(RecSys’16).

Conclusions• DeepLearningisnowinRecSys• Hugepotential,butlottodo– E.g.ExploremoreadvancedDLtechniques

• Currentresearchdirections– Itemembeddings– Deepcollaborativefiltering– Featureextractionfromcontent– Session-basedrecommendationswithRNNs

• Scalabilityshouldbekeptinmind• Don’tfallforthehypeBUTdon’tdisregardtheachievementsofDLanditspotentialforRecSys

Thankyou!

Technology

Deep Learning for Recommender Systems RecSys2017 Tutorial