Joint Training Capsule Network for Cold Start Recommendation · Joint Training Capsule Network for Cold Start Recommendation Tingting Liang Hangzhou Dianzi University Hangzhou, China

Joint Training Capsule Network for Cold Start RecommendationTingting LiangHangzhou Dianzi

UniversityHangzhou, [email protected]

Congying XiaUniversity of Illinois at

ChicagoChicago, [email protected]

Yuyu YinHangzhou Dianzi

UniversityHangzhou, China

[email protected]

Philip S. YuUniversity of Illinois at

ChicagoChicago, [email protected]

ABSTRACTThis paper proposes a novel neural network, joint training cap-sule network (JTCN), for the cold start recommendation task. Wepropose to mimic the high-level user preference other than theraw interaction history based on the side information for the freshusers. Specifically, an attentive capsule layer is proposed to ag-gregate high-level user preference from the low-level interactionhistory via a dynamic routing-by-agreement mechanism. Moreover,JTCN jointly trains the loss for mimicking the user preference andthe softmax loss for the recommendation together in an end-to-endmanner. Experiments on two publicly available datasets demon-strate the effectiveness of the proposed model. JTCN improves otherstate-of-the-art methods at least 7.07% for CiteULike and 16.85%for Amazon in terms of Recall@100 in cold start recommendation.

CCS CONCEPTS• Information systems→Recommender systems; •Comput-ing methodologies → Neural networks.KEYWORDSRecommender systems, Cold start, User preference estimation

ACM Reference Format:Tingting Liang, Congying Xia, Yuyu Yin, and Philip S. Yu. 2020. Joint Train-ing Capsule Network for Cold Start Recommendation. In 43rd InternationalACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’20), July 25–30, 2020, Xi’an, China.ACM, New York, NY, USA, 4 pages.https://doi.org/10.1145/XXXXXX.XXXXXX

1 INTRODUCTIONThe effectiveness of current recommender systems highly relies onthe interactions between users and items. These systems usuallydo not perform well when new users or new items arrive. Thischallenge is widely known as cold start recommendation [7]. Toalleviate this problem, models have been proposed to leverage sideinformation such as user attributes [1] or user social network data[7, 12] to generate recommendations for new users. These mod-els can be grouped into three categories based on how they usethe side information: similarity-based models [11] which calculatesimilarities between items based on the side information; matrix

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]’20, July 25–30, 2020, Xi’an, China© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00https://doi.org/10.1145/XXXXXX.XXXXXX

factorization methods with regularization [3] that regularize thelatent features based on auxiliary relations; matrix factorizationmethods with feature mapping [2] which learn a mapping betweenthe side information and latent features. According to [6], these coldstart models can be viewed within a simple unified linear frame-work which learns a mapping between the side information andthe interaction history.

Recently, deep learning models (DNNs) have emerged to tacklethe cold start problem by providing a larger model capacity. Volkovset al. [13] regard cold start recommendation as a data missing prob-lem and modifies the learning procedure by applying dropout toinput mini-batches. It highly depends on the generalization abilityof the dropout technique to generalize the model from warm startto cold start. [6] is the first work that proposes to solve the cold startproblem in the Zero-shot Learning (ZSL) perspective. It leveragesa low-rank auto-encoder to reconstruct interaction history fromthe user attributes. However, it is a two-step method which firstlylearns the reconstruction and then solves the recommendation forthe cold start users or items in the second step. A two-step methodmight suffer from the error propagation problem.

To avoid the aforementioned problems and fully understand thecontent in the side information, we propose an end-to-end jointtraining capsule network (JTCN) for cold start recommendation. InJTCN, a user is represented explicitly in two folds: the high-leveluser preference and the content contained in the side information.The high-level user preference is aggregated from the low-level in-teraction history through the attentive capsule layer with a dynamicrouting-by-agreement mechanism [10, 15].

A mimic loss is proposed to mimic the high-level user preferencefor cold start users or items from the side information. We arguethat it is more explainable to mimic the high-level user preferencethan the low-level interaction history. It would be natural to inferuser preference from the side information other than non-existentinteraction history. Another softmax loss is used to train the regularrecommendation process. Our goal is to not only mimic the high-level user preference for the cold start users or items, but alsoeffectively do recommendations for them. We propose to achieveour goals by jointly training these two losses together in an end-to-end manner. In summary, the contributions of this paper are:• Joint Training: A joint training framework is proposed for thecold start recommendation by training the mimic loss for thecold start and the softmax loss for the recommendation together.

• Capsule Network: An attentive capsule layer is proposed to ag-gregate high-level user preference from the low-level interactionhistory via a dynamic routing-by-agreement mechanism.

• Demonstrated Effectiveness: Experiments on two real-worlddatasets show that our proposed model outperforms baselinesconsistently for the cold start recommendation task.

arX

iv:2

005.

1146

7v1

[cs

.IR

] 2

3 M

ay 2

020

https://doi.org/10.1145/XXXXXX.XXXXXX

https://doi.org/10.1145/XXXXXX.XXXXXX

User Historical Items

Target Item

Item 1

Item 2

Item n

Softmax Loss

…

Cold Start Prediction

Attention

Item Embedding Layer

Joint Training

PreferenceCapsules

Mimic Loss

Item Emebeddings

User Doc Top N Result

Relu

Relu

preference content

User Preference Extraction

fc<latexit sha1_base64="cr7RohRQwEac5NXW89Use9t/NcI=">AAAB7HicbVC7SgNBFL3rM8ZX1FKRwSBYhd1YaBm0sUzATQLJEmYns8mQmdllZlYIS0prGwtFbP2GfIed3+BPOHkUmnjgwuGce7n3njDhTBvX/XJWVtfWNzZzW/ntnd29/cLBYV3HqSLUJzGPVTPEmnImqW+Y4bSZKIpFyGkjHNxO/MYDVZrF8t4MExoI3JMsYgQbK/lRJyOjTqHoltwp0DLx5qRYORnXvh9Px9VO4bPdjUkqqDSEY61bnpuYIMPKMMLpKN9ONU0wGeAebVkqsaA6yKbHjtC5VbooipUtadBU/T2RYaH1UIS2U2DT14veRPzPa6Umug4yJpPUUElmi6KUIxOjyeeoyxQlhg8twUQxeysifawwMTafvA3BW3x5mdTLJe+yVK7ZNG5ghhwcwxlcgAdXUIE7qIIPBBg8wQu8OtJ5dt6c91nrijOfOYI/cD5+AO79knc=</latexit>

fp<latexit sha1_base64="DaCKDjaT3d08vlcOvYdHK9s22aY=">AAAB7HicbVC7SgNBFL3rMyY+opY2g1GwCrux0DJoYxnBTQLJEmYns8mQmdllZjYQlnyDjYUitn6CP+Af2PkhWjt5FJp44MLhnHu5954w4Uwb1/10VlbX1jc2c1v5wvbO7l5x/6Cu41QR6pOYx6oZYk05k9Q3zHDaTBTFIuS0EQ6uJ35jSJVmsbwzo4QGAvckixjBxkp+1MmScadYcsvuFGiZeHNSqp58vb0PC9+1TvGj3Y1JKqg0hGOtW56bmCDDyjDC6TjfTjVNMBngHm1ZKrGgOsimx47RqVW6KIqVLWnQVP09kWGh9UiEtlNg09eL3kT8z2ulJroMMiaT1FBJZouilCMTo8nnqMsUJYaPLMFEMXsrIn2sMDE2n7wNwVt8eZnUK2XvvFy5tWlcwQw5OIJjOAMPLqAKN1ADHwgwuIdHeHKk8+A8Oy+z1hVnPnMIf+C8/gDXj5Mi</latexit>

up<latexit sha1_base64="//5A8uTI78XEs29IIkW/EtS2ziM=">AAAB9XicbVC7TsMwFL0prxJeBUYWiwqJqUrKAAuigoWxSPQh9SXHdVqrjhPZDqiK8h8sDDzEymewsyD+BqftAC1HsnR0zr26x8eLOFPacb6t3NLyyupaft3e2Nza3ins7tVVGEtCayTkoWx6WFHOBK1ppjltRpLiwOO04Y2uMr9xR6ViobjV44h2AjwQzGcEayN12wHWQ89P4rSbRGmvUHRKzgRokbgzUrz4sM+jly+72it8tvshiQMqNOFYqZbrRLqTYKkZ4TS127GiESYjPKAtQwUOqOokk9QpOjJKH/mhNE9oNFF/byQ4UGoceGYyS6nmvUz8z2vF2j/rJExEsaaCTA/5MUc6RFkFqM8kJZqPDcFEMpMVkSGWmGhTlG1KcOe/vEjq5ZJ7UirfOMXKJUyRhwM4hGNw4RQqcA1VqAEBCQ/wBM/WvfVovVpv09GcNdvZhz+w3n8AsSOWSA==</latexit>

uc<latexit sha1_base64="uGoEmlp8nyzpQrhFEKo1VSxmTAo=">AAAB9XicbVC7TsMwFL0prxJeBUYWiwqJqUrKAAuigoWxSPQh9SXHdVqrjhPZDqiK8h8sDDzEymewsyD+BqftAC1HsnR0zr26x8eLOFPacb6t3NLyyupaft3e2Nza3ins7tVVGEtCayTkoWx6WFHOBK1ppjltRpLiwOO04Y2uMr9xR6ViobjV44h2AjwQzGcEayN12wHWQ89P4rSbkLRXKDolZwK0SNwZKV582OfRy5dd7RU+2/2QxAEVmnCsVMt1It1JsNSMcJra7VjRCJMRHtCWoQIHVHWSSeoUHRmlj/xQmic0mqi/NxIcKDUOPDOZpVTzXib+57Vi7Z91EiaiWFNBpof8mCMdoqwC1GeSEs3HhmAimcmKyBBLTLQpyjYluPNfXiT1csk9KZVvnGLlEqbIwwEcwjG4cAoVuIYq1ICAhAd4gmfr3nq0Xq236WjOmu3swx9Y7z+dYpY7</latexit>

uc<latexit sha1_base64="i2rLcU0WJ/Ow+Z6D67uZx8y5Azk=">AAAB/XicbVDLSsNAFJ34rPEVHzs3g0VwVZK60I1YdOOygn1AU8tkOmmHTh7M3Ag1BH9FEBeKuPUT3LsR/8ZJ24W2Hhg4nHMv98zxYsEV2Pa3MTe/sLi0XFgxV9fWNzatre26ihJJWY1GIpJNjygmeMhqwEGwZiwZCTzBGt7gIvcbt0wqHoXXMIxZOyC9kPucEtBSx9p1+wRSNyDQ9/w0ybKblGYdq2iX7BHwLHEmpHj2YZ7Gj19mtWN9ut2IJgELgQqiVMuxY2inRAKngmWmmygWEzogPdbSNCQBU+10lD7DB1rpYj+S+oWAR+rvjZQESg0DT0/mMdW0l4v/ea0E/JN2ysM4ARbS8SE/ERginFeBu1wyCmKoCaGS66yY9okkFHRhpi7Bmf7yLKmXS85RqXxlFyvnaIwC2kP76BA56BhV0CWqohqi6A49oGf0YtwbT8ar8TYenTMmOzvoD4z3HwEbmTk=</latexit>

up<latexit sha1_base64="PPUC1eTrOgYKpyyr8ucEMFIxA0c=">AAAB/XicbVDLSgMxFM34rONrfOzcBIvgqszUhW7EohuXFewDOrVk0kwbmsmEJCPUYfBXBHGhiFs/wb0b8W/MtF1o64HA4Zx7uScnEIwq7brf1tz8wuLScmHFXl1b39h0trbrKk4kJjUcs1g2A6QIo5zUNNWMNIUkKAoYaQSDi9xv3BKpaMyv9VCQdoR6nIYUI22kjrPr95FO/QjpfhCmSZbdpCLrOEW35I4AZ4k3IcWzD/tUPH7Z1Y7z6XdjnESEa8yQUi3PFbqdIqkpZiSz/UQRgfAA9UjLUI4iotrpKH0GD4zShWEszeMajtTfGymKlBpGgZnMY6ppLxf/81qJDk/aKeUi0YTj8aEwYVDHMK8CdqkkWLOhIQhLarJC3EcSYW0Ks00J3vSXZ0m9XPKOSuUrt1g5B2MUwB7YB4fAA8egAi5BFdQABnfgATyDF+veerJerbfx6Jw12dkBf2C9/wAU3JlG</latexit>

e1<latexit sha1_base64="354Fq8D9FSdrGda5mPkYyVI6v+A=">AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclaQKuiy6cVnBPqCNZTK9aYdOHsxMlBLyH25cKOLWf3Hn3zhps9DWAwOHc+7lnjleLLjStv1trayurW9slrbK2zu7e/uVg8O2ihLJsMUiEcmuRxUKHmJLcy2wG0ukgSew401ucr/ziFLxKLzX0xjdgI5C7nNGtZEe+gHVY89PMRukTjaoVO2aPQNZJk5BqlCgOah89YcRSwIMNRNUqZ5jx9pNqdScCczK/URhTNmEjrBnaEgDVG46S52RU6MMiR9J80JNZurvjZQGSk0Dz0zmKdWil4v/eb1E+1duysM40Riy+SE/EURHJK+ADLlEpsXUEMokN1kJG1NJmTZFlU0JzuKXl0m7XnPOa/W7i2rjuqijBMdwAmfgwCU04Baa0AIGEp7hFd6sJ+vFerc+5qMrVrFzBH9gff4A2smSvQ==</latexit>

e2<latexit sha1_base64="YROOCGR0pAL46mM42/sUcr5YK9s=">AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclaQKuiy6cVnBPqCNZTK9aYdOHsxMlBLyH25cKOLWf3Hn3zhps9DWAwOHc+7lnjleLLjStv1trayurW9slrbK2zu7e/uVg8O2ihLJsMUiEcmuRxUKHmJLcy2wG0ukgSew401ucr/ziFLxKLzX0xjdgI5C7nNGtZEe+gHVY89PMRuk9WxQqdo1ewayTJyCVKFAc1D56g8jlgQYaiaoUj3HjrWbUqk5E5iV+4nCmLIJHWHP0JAGqNx0ljojp0YZEj+S5oWazNTfGykNlJoGnpnMU6pFLxf/83qJ9q/clIdxojFk80N+IoiOSF4BGXKJTIupIZRJbrISNqaSMm2KKpsSnMUvL5N2veac1+p3F9XGdVFHCY7hBM7AgUtowC00oQUMJDzDK7xZT9aL9W59zEdXrGLnCP7A+vwB3E6Svg==</latexit>

el<latexit sha1_base64="Dl0HK5S8PVEKlPnJ6/rXMcrMEAM=">AAAB9XicbVDLSsNAFL3xWeur6tLNYBFclaQKuiy6cVnBPqCNZTK9aYdOHsxMlBLyH25cKOLWf3Hn3zhps9DWAwOHc+7lnjleLLjStv1trayurW9slrbK2zu7e/uVg8O2ihLJsMUiEcmuRxUKHmJLcy2wG0ukgSew401ucr/ziFLxKLzX0xjdgI5C7nNGtZEe+gHVY89PMRukIhtUqnbNnoEsE6cgVSjQHFS++sOIJQGGmgmqVM+xY+2mVGrOBGblfqIwpmxCR9gzNKQBKjedpc7IqVGGxI+keaEmM/X3RkoDpaaBZybzlGrRy8X/vF6i/Ss35WGcaAzZ/JCfCKIjkldAhlwi02JqCGWSm6yEjamkTJuiyqYEZ/HLy6RdrznntfrdRbVxXdRRgmM4gTNw4BIacAtNaAEDCc/wCm/Wk/VivVsf89EVq9g5gj+wPn8ANH+S+A==</latexit>

en<latexit sha1_base64="HSUI5kebN8qLXgyqrQji6l9CLZw=">AAAB9XicbVDLSsNAFL3xWeur6tJNsAiuSlIFXRbduKxgH9DGMpnetEMnkzAzUUrIf7hxoYhb/8Wdf+OkzUJbDwwczrmXe+b4MWdKO863tbK6tr6xWdoqb+/s7u1XDg7bKkokxRaNeCS7PlHImcCWZppjN5ZIQp9jx5/c5H7nEaVikbjX0xi9kIwECxgl2kgP/ZDosR+kmA1SkQ0qVafmzGAvE7cgVSjQHFS++sOIJiEKTTlRquc6sfZSIjWjHLNyP1EYEzohI+wZKkiIyktnqTP71ChDO4ikeULbM/X3RkpCpaahbybzlGrRy8X/vF6igysvZSJONAo6PxQk3NaRnVdgD5lEqvnUEEIlM1ltOiaSUG2KKpsS3MUvL5N2veae1+p3F9XGdVFHCY7hBM7AhUtowC00oQUUJDzDK7xZT9aL9W59zEdXrGLnCP7A+vwBN4mS+g==</latexit>

p1<latexit sha1_base64="CoGS4GV4AVc+urj0Mo9Gde6m2v4=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LLoxmUF+4B2LJk004ZmkiHJKGWY/3DjQhG3/os7/8ZMOwttPRA4nHMv9+QEMWfauO63s7K6tr6xWdoqb+/s7u1XDg7bWiaK0BaRXKpugDXlTNCWYYbTbqwojgJOO8HkJvc7j1RpJsW9mcbUj/BIsJARbKz00I+wGQdhGmeD1MsGlapbc2dAy8QrSBUKNAeVr/5QkiSiwhCOte55bmz8FCvDCKdZuZ9oGmMywSPas1TgiGo/naXO0KlVhiiUyj5h0Ez9vZHiSOtpFNjJPKVe9HLxP6+XmPDKT5mIE0MFmR8KE46MRHkFaMgUJYZPLcFEMZsVkTFWmBhbVNmW4C1+eZm06zXvvFa/u6g2ros6SnAMJ3AGHlxCA26hCS0goOAZXuHNeXJenHfnYz664hQ7R/AHzucP66ySyA==</latexit>

p2<latexit sha1_base64="DRqn2R4OqKI29CXKfOfMX3OGMt8=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LLoxmUF+4B2LJk004ZmMiHJKGWY/3DjQhG3/os7/8ZMOwttPRA4nHMv9+QEkjNtXPfbWVldW9/YLG2Vt3d29/YrB4dtHSeK0BaJeay6AdaUM0FbhhlOu1JRHAWcdoLJTe53HqnSLBb3ZiqpH+GRYCEj2FjpoR9hMw7CVGaDtJ4NKlW35s6AlolXkCoUaA4qX/1hTJKICkM41rrnudL4KVaGEU6zcj/RVGIywSPas1TgiGo/naXO0KlVhiiMlX3CoJn6eyPFkdbTKLCTeUq96OXif14vMeGVnzIhE0MFmR8KE45MjPIK0JApSgyfWoKJYjYrImOsMDG2qLItwVv88jJp12veea1+d1FtXBd1lOAYTuAMPLiEBtxCE1pAQMEzvMKb8+S8OO/Ox3x0xSl2juAPnM8f7TGSyQ==</latexit> …

pK<latexit sha1_base64="OJfmnMQgQWz+ydKgv9MEPQJ5vTo=">AAAB9XicbVDLSsNAFL2pr1pfVZduBovgqiRV0GXRjeCmgn1AG8tkOmmHTiZhZqKUkP9w40IRt/6LO//GSZqFth4YOJxzL/fM8SLOlLbtb6u0srq2vlHerGxt7+zuVfcPOiqMJaFtEvJQ9jysKGeCtjXTnPYiSXHgcdr1pteZ332kUrFQ3OtZRN0AjwXzGcHaSA+DAOuJ5ydROkxu02G1ZtftHGiZOAWpQYHWsPo1GIUkDqjQhGOl+o4daTfBUjPCaVoZxIpGmEzxmPYNFTigyk3y1Ck6McoI+aE0T2iUq783EhwoNQs8M5mlVIteJv7n9WPtX7oJE1GsqSDzQ37MkQ5RVgEaMUmJ5jNDMJHMZEVkgiUm2hRVMSU4i19eJp1G3TmrN+7Oa82roo4yHMExnIIDF9CEG2hBGwhIeIZXeLOerBfr3fqYj5asYucQ/sD6/AETPZLi</latexit>

BehaviorCapsules

Training Process Prediction Process

Figure 1: The architecture of JTCN. Id features of items are transformed into embeddings through the embedding layer. The em-beddings of historical items are fed into the attentivemulti-preference extraction layer, which consists of onemulti-preferenceextraction layer and one attention layer, to obtain the user preference representation. The content representation produced bynetwork fc is fused with the preference embedding to form the softmax loss. The output of network fp is used to approximateuser preference through a mimic loss. When making predictions for a new user, fc and fp is able to generate a comprehensiverepresentation for the user based on the side information, including both preference and content information.

2 PROPOSED MODEL2.1 Problem StatementWe consider a recommender systemwith a user setU = {u1, . . . ,uN }and an item set V = {v1, . . . ,vM }, where N is the number ofusers and M is the number of items. The user-item feedback canbe represented by a matrix R ∈ RN×M , where Ri j = 1 if user igives a positive feedback on item j, and Ri j = 0 otherwise. LetU(j) = {i ∈ U|Ri j , 0} be the set of users that had shown prefer-ence to item j , and V(i) = {j ∈ V|Ri j , 0} be the set of items thatuser i gave positive feedback. This paper focuses on the cold startscenario in which no preference clue is available, namely,V(i) = �orU(j) = � for a given user i or given item j. Our objective is togenerate personalized recommendation results for each fresh useror item based on its corresponding side information.

2.2 User Multi-Preference ExtractionThe framework of our proposed model is illustrated in Figure 1. Theframework here is mainly for cold start users, and the frameworkfor cold start items can be modeled in the same manner. As shownin Figure 1, the input of JTCN consists of user documents whichcontain the side information (labeled as User Doc in the figure),historical items, and the target item. The former two can be re-spectively used for extracting user properties and preferences. Thetarget item is the one that we use to make a prediction for the userduring the training process. The usage of user documents will bediscussed in Section 2.3. This section focuses on the extraction ofhigh-level user preferences.2.2.1 Embedding Layer. The user preference extraction part startswith the item embedding layer which embeds the id features ofitems into low-dimensional dense vectors. For the target item, theembedding is denoted as el ∈ Rd . For the historical items of user i(i.e., V(i)), corresponding item embeddings are gathered to formthe set of user preference embeddings Ei = {ej , j ∈ V(i)}.2.2.2 Attentive Capsule Layer for Multi-preference Extraction. Animportant task of our JTCN is to learn a network for mimickinghigh-level preference representations for cold start users from their

side information. Therefore, it is crucial to construct a representa-tive user preference embedding during the training process. Repre-senting user preference by a simple combination (e.g., averaging,concatenation) of vectors ej ∈ Ei is not conducive to extractingdiverse interests of users. Inspired by [5], we propose to apply therecently proposed dynamic routing in capsule network [10] to cap-ture multiple preferences for users. Considering that not all thepreference capsules contribute equally to aggregate the high-leveluser preference representation. We further propose to adopt theattention mechanism to discriminate the informative capsules.

We consider two layers of capsules, which we name as behaviorcapsules and preference capsules, to represent the user behavior(historical items) and multiple user preferences respectively. Dy-namic routing is adopted to compute the vectors of preferencecapsules based on the vectors of behavior capsules in an iterativeway. In each step, given the embedding ej ∈ Rd of behavior capsulej and vector pk ∈ Rd of preference capsule k , the routing logit iscalculated by

bjk = pTk Sej , (1)where S ∈ Rd×d denotes the bilinear mapping matrix parametershared across each pair of behavior and preference capsules.

The coupling coefficients between behavior capsule j and all thepreference capsules sum to 1 and are determined by performingthe “routing softmax” on logits as:

c jk =exp(bjk )∑t exp(bjt )

. (2)

With the coupling coefficients calculated, the candidate vector forpreference capsule k is computed by the weighted sum of all be-havior capsules:

zk =∑jc jkSej . (3)

The embedding of preference capsule k is obtained by a non-linear“squash” function as:

pk = squash(zk ) =| |zk | |2

1 + | |zk | |2zk

| |zk | |. (4)

Suppose we have K preference capsules, which means there areK distinct preferences of users extracted from the historical itemson which the users gave positive feedback. We apply an attentionlayer to emphasize the informative capsules. There exist severaleffective ways to calculate the attention score and this paper adoptsthe multi-layer perceptron (MLP) as

ak = hT ReLU(Wapk + ba ), αk =exp(ak )∑Kk=1 exp(ak )

, (5)

whereWa ∈ Rd×da , ba ∈ Rda , and h ∈ Rda are the attention layerparameters. The final attentive weight is normalized by the softmaxfunction.

With the attentive weights assigned to the preference capsules,the high-level user preference can be formed as the weighted sum:

up =K∑k=1

αkpk . (6)

2.3 Joint TrainingA joint training framework is proposed here by optimizing twolosses together: a softmax loss for recommendation and a mimicloss for generating user preferences for cold start users without in-teraction history. The user representation in JTCN is represented intwo folds, the user preferences and the content. Two MLP networks,namely fc and fp , are used to map the user document into one con-tent space and one preference space for the cold start users. Thosetwo representations are fused together for the the final prediction.

2.3.1 Softmax Loss. The output of fc denoted by uc is fused to-gether with the high-level user preference embedding defined by(6) to form the user embedding. The representation of user i can begenerated by

ui = ReLU(Wu [fc (Xi ), upi ] + bu ), (7)

where Xi denotes the input user document,Wu ∈ Rd×2d and bu ∈Rd are the parameters. [·, ·] denotes the concatenation operationand ReLU(·) is the Rectified Linear Unit. With the user vector ui andthe target item embedding el , the probability of the user interactingwith the target item can be predicted by

Pr(el |ui ) =exp(uTi el )∑

j ∈V(i) exp(uTi ej ). (8)

We use the softmax loss as the objective function to minimize forthe recommendation training:

Lsof tmax = −∑

(i,l )∈DlogPr(el |ui ), (9)

where D is the collection of training data containing user-iteminteractions.2.3.2 Mimic Loss. In order to learn preference information fromthe document of a new user, we propose to use the output of fp toapproximate the high-level user preference representation definedby (6). We define the mimic loss as the mean square difference as:

Lmimic =1|D|

∑(i,l )∈D

∑d

(upi − fp (Xi ))2. (10)

Jointly training the following combination of softmax loss andmimic loss enables the network to better imitate the high-level pref-erence and capture content from the side information of new users,which greatly improve the cold start recommendation performance:

Ljoint = Lsof tmax + Lmimic . (11)

2.4 Cold Start PredictionOnce training is completed, as shown in the right side of Figure 1,we fix the model and make a forward pass through fc and fp to getthe representation for a new user based on its side information as:

unew = ReLU(Wu [uc , up ] + bu ), (12)

where uc = fc (Xnew ) and up = fp (Xnew ). At last, the preferencescore of the new user on item j is decided by the inner product ofthe corresponding embeddings:

rnew, j = uTnew ej . (13)

3 EXPERIMENTS3.1 DatasetsWe choose two public datasets for evaluating cold start recommen-dation performance. 1) CiteULike1 with 5,551 users, 16,980 articles,and 204,986 implicit user-article feedbacks. CiteULike contains ar-ticle content formation in the form of title and abstract. We use avocabulary of the top 8,000 words selected by tf-idf [14]. 2) AmazonMovies and TV2 [8]. We convert the explicit feedbacks with rating 5to implicit feedbacks. We filter the user and items with interactionsless than 10 and finally get 14,850 users, 23,232 items, and 548,296interactions. The vocabulary size of words selected by tf-idf fromitem titles and descriptions is 10,000.

Since only item side information is available, we recommendusers for the cold start items. For both datasets, we randomly select20% of items as the cold start items which will be recommendedusers at test time. We use Recall and NDCG as evaluation metrics.3.2 BaselinesWe compare the proposed JTCN with several representative recom-mendationmodels including three content-basedmethodsKNN [11],FM [9], and VBPR [3], two deep learning methods LLAE [6] andDropoutNet [13]. KNN uses content information to compute thecosine similarity between items. DropoutNet uses WMF [4] as thepre-trained model for input preference. Except for KNN and LLAEwhich do not have the parameter of latent factor, we set the numberof latent factors d = 256 for all methods. The other hyperparam-eters of all the compared methods are tuned to find an optimalresult. For JTCN, the number of preference capsules is set K = 5,the dimension of attention layer is set da = 128, and the Adamoptimizer with the learning rate of 0.0005 is adopted. For all themethods except KNN, we use the early stopping strategy with apatience of 10.3.3 Results3.3.1 Model Comparison. The experimental results of our JTCN aswell as baselines on two datasets are reported in Table 1 in termsof Recall@100 and NDCG@100 (with d = 256). The best results arelisted in bold, and the second best results are marked with star (*).Clearly, JTCN remarkably outperforms baseline models on bothdatasets. KNN shows poor performance in the cold start scenario,which led by the rough estimation of content-based similarity with-out any historical interaction. The improvement obtained by FMcompared with KNN indicates the advantage of feature interaction.VBPR performs slightly better as it is proposed to alleviate the cold

1http://www.citeulike.org.2http://jmcauley.ucsd.edu/data/amazon/

Table 1: Performance Comparison on two datasets in termsof Recall@100 and NDCG@100.

MethodsCiteULike Amazon

Recall NDCG Recall NDCG

KNN 0.2981 0.3453 0.0564 0.2358FM 0.5100 0.4583 0.0924 0.2260VBPR 0.5426 0.4825 0.0891 0.2215LLAE 0.5816 0.5286* 0.1264* 0.2439DropoutNet 0.6011* 0.5226 0.1013 0.2815*

JTCN 0.6436 0.5432 0.1477 0.3364Improve 7.07% 2.76% 16.85% 19.50%

start problem by using both latent factors and content factors thatare extracted from auxiliary information [3]. It can be easily ob-served that deep learning based methods, LLAE and DropoutNet,which are dedicated to cold start problem, perform better than thetraditional content-based baselines. However, all the baselines onlyreconstruct or extract content factors from the input side informa-tion in the test stage. JTCN outperforms all baselines improvingRecall@100 by 7.07% and 16.85% on two datasets over the best base-line. This indicates that combining the content information withpreference information generated based on the raw input of newusers or items can effectively improve the performance of cold startrecommendation.

In addition, DropoutNet has a need for the pre-trained modelto generate preference input for the main DNN, which may limitits generalization on different datasets. In contrast, the proposedJTCN doesn’t need such a pre-trained model to handle the input bylearning directly from the input raw features.3.3.2 Impact of Latent Factors. To analyze the importance of latentfactors, we compare the performance of FM, VBPR, and Dropout-Net with the proposed JTCN with respect to the number of latentfactors. As Figure 2 shows, JTCN consistently outperforms thebaselines. With the increase of the number of latent factors, theperformance improvement compared with the best baseline methodgenerally increases. It may be because the combination of contentand preference representations is more informative, which requiresa relatively larger hidden dimension to incorporate.

4 CONCLUSIONIn this paper, a novel neural network model, namely joint trainingcapsule network (JTCN) is first introduced to harness the advan-tages of capsule model for extracting high-level user preferencein the cold start recommendation task. JTCN optimizes the mimicloss and softmax loss together in an end-to-end manner: the mimicloss is used to mimic the preference for cold start users or items;the softmax loss is trained for recommendation. An attentive cap-sule layer is proposed to aggregate high-level preference from thelow-level interaction history via a dynamic routing-by-agreementmechanism. Experiments on two real-world datasets show that ourJTCN consistently outperforms baselines.

ACKNOWLEDGMENTSThis work is supported in part by NSFC under grant 61872119, NSFunder grants III-1526499, III-1763325, III-1909323, CNS-1930941,

16 32 64 128 256Factors

0.2

0.3

0.4

0.5

0.6

0.7

Reca

ll@10

0

CiteULike

FMVBPRDropoutNetJTCN

(a) CiteULike–Recall@100

16 32 64 128 256Factors

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

NDCG

@10

0

CiteULike

FMVBPRDropoutNetJTCN

(b) CiteULike–NDCG@100

16 32 64 128 256Factors

0.025

0.050

0.075

0.100

0.125

0.150

0.175

0.200

Reca

ll@10

0

AmazonFMVBPRDropoutNetJTCN

(c) Amazon–Recall@100

16 32 64 128 256Factors

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

NDCG

@10

0

AmazonFMVBPRDropoutNetJTCN

(d) Amazon–NDCG@100

Figure 2: Performance of Recall@100 and NDCG@100 w.r.tthe number of predictive factors on the two datasets.

and Key Research & Development Plan of Zhejiang Province undergrant 2019C03134.

REFERENCES[1] Ignacio Fernández-Tobías, Matthias Braunhofer, Mehdi Elahi, Francesco Ricci,

and Iván Cantador. 2016. Alleviating the new user problem in collaborativefiltering by exploiting personality information. User Modeling and User-AdaptedInteraction 26, 2-3 (2016), 221–255.

[2] Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steffen Rendle, andLars Schmidt-Thieme. 2010. Learning attribute-to-feature mappings for cold-startrecommendations. In ICDM. 176–185.

[3] Ruining He and Julian McAuley. 2016. VBPR: visual bayesian personalizedranking from implicit feedback. In AAAI. 144–150.

[4] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering forimplicit feedback datasets. In ICDM. 263–272.

[5] Chao Li, Zhiyuan Liu, Mengmeng Wu, Yuchi Xu, Huan Zhao, Pipei Huang,Guoliang Kang, Qiwei Chen, Wei Li, and Dik Lun Lee. 2019. Multi-interestnetwork with dynamic routing for recommendation at Tmall. In CIKM. 2615–2623.

[6] Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, Yang Yang, and Zi Huang. 2019. Fromzero-shot learning to cold-start recommendation. In AAAI, Vol. 33. 4189–4196.

[7] Jovian Lin, Kazunari Sugiyama, Min-Yen Kan, and Tat-Seng Chua. 2013. Ad-dressing cold-start in app recommendation: latent user models constructed fromtwitter followers. In SIGIR. 283–292.

[8] Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying Recommendationsusing Distantly-Labeled Reviews and Fine-Grained Aspects. In EMNLP. 188–197.

[9] Steffen Rendle. 2010. Factorization machines. In ICDM. 995–1000.[10] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing

between capsules. In NIPS 2017. 3856–3866.[11] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based

collaborative filtering recommendation algorithms. In WWW. 285–295.[12] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, Lexing Xie, and Darius

Braziunas. 2017. Low-rank linear cold-start recommendation from social data. InAAAI. 1502–1508.

[13] Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. 2017. Dropoutnet: Address-ing cold start in recommender systems. In NIPS. 4957–4966.

[14] Chong Wang and David M Blei. 2011. Collaborative topic modeling for recom-mending scientific articles. In KDD. 448–456.

[15] Congying Xia, Chenwei Zhang, Xiaohui Yan, Yi Chang, and S Yu Philip. 2018.Zero-shot User Intent Detection via Capsule Neural Networks. In EMNLP 2018.3090–3099.

Documents

Joint Training Capsule Network for Cold Start Recommendation · Joint Training Capsule Network for Cold Start Recommendation Tingting Liang Hangzhou Dianzi University Hangzhou, China