10
Compositional Coding for Collaborative Filtering Chenghao Liu 1,2 , Tao Lu 2,4 , Xin Wang 3 , Zhiyong Cheng 5,6 , Jianling Sun 2,4 , Steven C.H. Hoi 1 1 Singapore Management University, 2 Zhejiang University, 3 Tsinghua University 4 Alibaba-Zhejiang University Joint Institute of Frontier Technologies 5 Shandong Computer Science Center (National Supercomputer Center in Jinan) 6 Qilu University of Technology (Shandong Academy of Sciences) {twinsken,3140102441,sunjl}@zju.edu.cn,[email protected],[email protected],[email protected] ABSTRACT Efficiency is crucial to the online recommender systems, especially for the ones which needs to deal with tens of millions of users and items. Because representing users and items as binary vectors for Collaborative Filtering (CF) can achieve fast user-item affinity com- putation in the Hamming space, in recent years, we have witnessed an emerging research effort in exploiting binary hashing techniques for CF methods. However, CF with binary codes naturally suffers from low accuracy due to limited representation capability in each bit, which impedes it from modeling complex structure of the data. In this work, we attempt to improve the efficiency without hurting the model performance by utilizing both the accuracy of real-valued vectors and the efficiency of binary codes to rep- resent users/items. In particular, we propose the Compositional Coding for Collaborative Filtering (CCCF) framework, which not only gains better recommendation efficiency than the state-of-the- art binarized CF approaches but also achieves even higher accuracy than the real-valued CF method. Specifically, CCCF innovatively represents each user/item with a set of binary vectors, which are associated with a sparse real-value weight vector. Each value of the weight vector encodes the importance of the corresponding binary vector to the user/item. The continuous weight vectors greatly en- hances the representation capability of binary codes, and its sparsity guarantees the processing speed. Furthermore, an integer weight approximation scheme is proposed to further accelerate the speed. Based on the CCCF framework, we design an efficient discrete opti- mization algorithm to learn its parameters. Extensive experiments on three real-world datasets show that our method outperforms the state-of-the-art binarized CF methods (even achieves better performance than the real-valued CF method) by a large margin in terms of both recommendation accuracy and efficiency. We publish our project at https://github.com/3140102441/CCCF. CCS CONCEPTS Information systems Recommender System; Human- centered computing Collaborative filtering; Jianling Sun is the corresponding author. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). SIGIR’19, 2019, July 2019, Paris, France © 2019 Copyright held by the owner/author(s). ACM ISBN 123-4567-24-567/08/06. . . $15.00 https://doi.org/10.475/123_4 KEYWORDS Recommendation, Collaborative Filtering, Discrete Hashing 1 INTRODUCTION Real-world recommender systems often have to deal with large numbers of users and items especially for online applications, such as e-commerce or music streaming services [2, 3, 13, 14, 27]. For many modern recommender systems, a de facto solution is often based on Collaborative Filtering (CF) techniques, as exemplified by Matrix Factorization (MF) algorithms [8]. The principle of MF is to represent users’ preferences and items’ characteristics into r low-dimensional vectors, based on the m × n user-item interaction matrix of m users and n items. With the obtained user and item vectors (in the offline training stage), during the online recommen- dation stage, the preference of a user towards an item is computed by the dot product of their represented vectors. However, when dealing with large numbers of users and items (e.g., millions or even billions of users and items), a naive implementation of typical collaborative filtering techniques (e.g., based on MF) will lead to very high computation cost for generating preferred item ranking list for a target user [11]. Specifically, recommending the top- k pre- ferred items for a user from those n items costs O ( nr + n log k ) with real-valued vectors. As a result, this process will become a critical efficiency bottleneck in practice where the recommender systems typically require a real-time response for large-scale users simulta- neously. Therefore, a fast and scalable yet accurate CF solution is crucial towards building real-time recommender systems. Recent years have witnessed extensive research efforts for im- proving the efficiency of CF methods for scalable recommender systems. One promising paradigm is to explore the hashing tech- niques [18, 33, 35] to represent users/items with binary codes in- stead of the real-value latent factors in traditional MF methods. In this way, the dot-products of user vector and item vector in MF can be completed by fast bit-operations in the Hamming space [35]. Furthermore, by exploiting special data structures for indexing all items, the computational complexity of generating top-K preferred items is sub-linear or even constant [26, 33], which significantly accelerates the recommendation process. However, learning the binary codes is generally NP-hard [5] due to its discrete constraints. Given this NP-hardness, a two-stage optimization procedure [18, 33, 35], which first solves a relaxed optimization problem through ignoring the discrete constraints and then binarizes the results by thresholding, becomes a compro- mising solution. Nevertheless, this solution suffers from a large quantization loss [30] and thus fails to preserve the original data geometry (user-item relevance and user-user relationship) in the arXiv:1905.03752v1 [cs.IR] 9 May 2019

Compositional Coding for Collaborative Filtering · to them which is categorized as a Drama movie. The real-valued vectors could easily preserve the original data geometry in the

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Compositional Coding for Collaborative Filtering · to them which is categorized as a Drama movie. The real-valued vectors could easily preserve the original data geometry in the

Compositional Coding for Collaborative Filtering∗

Chenghao Liu1,2, Tao Lu2,4, Xin Wang3, Zhiyong Cheng5,6, Jianling Sun2,4, Steven C.H. Hoi11Singapore Management University, 2Zhejiang University, 3Tsinghua University

4Alibaba-Zhejiang University Joint Institute of Frontier Technologies5Shandong Computer Science Center (National Supercomputer Center in Jinan)

6Qilu University of Technology (Shandong Academy of Sciences){twinsken,3140102441,sunjl}@zju.edu.cn,[email protected],[email protected],[email protected]

ABSTRACTEfficiency is crucial to the online recommender systems, especiallyfor the ones which needs to deal with tens of millions of users anditems. Because representing users and items as binary vectors forCollaborative Filtering (CF) can achieve fast user-item affinity com-putation in the Hamming space, in recent years, we have witnessedan emerging research effort in exploiting binary hashing techniquesfor CF methods. However, CF with binary codes naturally suffersfrom low accuracy due to limited representation capability in eachbit, which impedes it from modeling complex structure of the data.

In this work, we attempt to improve the efficiency withouthurting the model performance by utilizing both the accuracyof real-valued vectors and the efficiency of binary codes to rep-resent users/items. In particular, we propose the CompositionalCoding for Collaborative Filtering (CCCF) framework, which notonly gains better recommendation efficiency than the state-of-the-art binarized CF approaches but also achieves even higher accuracythan the real-valued CF method. Specifically, CCCF innovativelyrepresents each user/item with a set of binary vectors, which areassociated with a sparse real-value weight vector. Each value of theweight vector encodes the importance of the corresponding binaryvector to the user/item. The continuous weight vectors greatly en-hances the representation capability of binary codes, and its sparsityguarantees the processing speed. Furthermore, an integer weightapproximation scheme is proposed to further accelerate the speed.Based on the CCCF framework, we design an efficient discrete opti-mization algorithm to learn its parameters. Extensive experimentson three real-world datasets show that our method outperformsthe state-of-the-art binarized CF methods (even achieves betterperformance than the real-valued CF method) by a large margin interms of both recommendation accuracy and efficiency. We publishour project at https://github.com/3140102441/CCCF.

CCS CONCEPTS• Information systems → Recommender System; • Human-centered computing → Collaborative filtering;

∗Jianling Sun is the corresponding author.

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).SIGIR’19, 2019, July 2019, Paris, France© 2019 Copyright held by the owner/author(s).ACM ISBN 123-4567-24-567/08/06. . . $15.00https://doi.org/10.475/123_4

KEYWORDSRecommendation, Collaborative Filtering, Discrete Hashing

1 INTRODUCTIONReal-world recommender systems often have to deal with largenumbers of users and items especially for online applications, suchas e-commerce or music streaming services [2, 3, 13, 14, 27]. Formany modern recommender systems, a de facto solution is oftenbased on Collaborative Filtering (CF) techniques, as exemplifiedby Matrix Factorization (MF) algorithms [8]. The principle of MFis to represent users’ preferences and items’ characteristics into rlow-dimensional vectors, based on them × n user-item interactionmatrix ofm users and n items. With the obtained user and itemvectors (in the offline training stage), during the online recommen-dation stage, the preference of a user towards an item is computedby the dot product of their represented vectors. However, whendealing with large numbers of users and items (e.g., millions oreven billions of users and items), a naive implementation of typicalcollaborative filtering techniques (e.g., based on MF) will lead tovery high computation cost for generating preferred item rankinglist for a target user [11]. Specifically, recommending the top-k pre-ferred items for a user from those n items costsO(nr +n logk) withreal-valued vectors. As a result, this process will become a criticalefficiency bottleneck in practice where the recommender systemstypically require a real-time response for large-scale users simulta-neously. Therefore, a fast and scalable yet accurate CF solution iscrucial towards building real-time recommender systems.

Recent years have witnessed extensive research efforts for im-proving the efficiency of CF methods for scalable recommendersystems. One promising paradigm is to explore the hashing tech-niques [18, 33, 35] to represent users/items with binary codes in-stead of the real-value latent factors in traditional MF methods. Inthis way, the dot-products of user vector and item vector in MFcan be completed by fast bit-operations in the Hamming space [35].Furthermore, by exploiting special data structures for indexing allitems, the computational complexity of generating top-K preferreditems is sub-linear or even constant [26, 33], which significantlyaccelerates the recommendation process.

However, learning the binary codes is generally NP-hard [5]due to its discrete constraints. Given this NP-hardness, a two-stageoptimization procedure [18, 33, 35], which first solves a relaxedoptimization problem through ignoring the discrete constraintsand then binarizes the results by thresholding, becomes a compro-mising solution. Nevertheless, this solution suffers from a largequantization loss [30] and thus fails to preserve the original datageometry (user-item relevance and user-user relationship) in the

arX

iv:1

905.

0375

2v1

[cs

.IR

] 9

May

201

9

Page 2: Compositional Coding for Collaborative Filtering · to them which is categorized as a Drama movie. The real-valued vectors could easily preserve the original data geometry in the

SIGIR’19, 2019, July 2019, Paris, France Chenghao and Tao, et al.

continuous real-valued vector space. As accuracy is arguably themost important evaluation metric for recommender systems, re-searchers put lots of efforts to reduce the quantization loss by directdiscrete optimization [12, 16, 30]. In spite of the advantages of thisimproved optimization method, compared to real-valued vectors,CF with binary codes naturally suffers from low accuracy due tolimited representation capability in each bit, which impedes it frommodeling complex relationship between users and items.

𝒅𝟏

𝒅𝟐

𝒅𝟑

𝒅𝟒 Titanic

Star Wars II

The Matrix Reloaded

Star Wars Action, Adventure

Action, Adventure

Action, Sci-Fi

Drama, Romance

Film Title Genres

x

y 𝒅𝟏

𝒅𝟐

𝒅𝟑

𝒅𝟒

𝑣4

𝑣1

𝑣2

𝑣3

-1 +1

+1 -1-1 -1

+1 +1

𝒗𝟏

𝒗𝟐

𝒗𝟑

𝒗𝟒

Figure 1: A toy example to illustrate the limitation of bi-nary codes of Discrete Collaborative Filtering (DCF) [30].v1, v2, v3, and v4 denote the real-valued vectors for item em-beddings, and d1, d2, d3, and d4 denote the binary codes foritem embeddings. According to the film title and genres, v1is the most similar to v2, followed by v3, while they are alldissimilar to v4. However, the binary codes learned by DCFcannot preserve the intrinsic similarity due to the limitedrepresentation capability of binary codes.

Figure 1 gives an example to illustrate the limit of binary codes.From the “film title" and “genres", we can see that Star Wars is themost similar with Star Wars II, followed by The Matrix Reloaded, andall of them are action movies while Titanic is remarkably dissimilarto them which is categorized as a Drama movie. The real-valuedvectors could easily preserve the original data geometry in thecontinuous vector space (e.g., intrinsic movie relationships), likev1, v2, v3, and v4. However, if we preserve the geometric relationsof Star Wars, Star Wars II and The Matrix Reloaded by representingthem using the binary codes d1, d2, and d3 in the Hamming space,the binary code of movie Titanic d4 will become close to d2 and d3,which unfortunately leads to a large error.

In this work, we attempt to improve the efficiency without hurt-ing the model performance. We propose a new user/item repre-sentation named “Compositional Coding", which utilizes both theaccuracy of real-valued vectors and the efficiency of binary codesto represent users and items. To improve the representation ca-pability of the binary codes, each user/item is represented by Gcomponents of r -dimensional binary codes (r is relatively small)together with a G-dimensional sparse weight vector. The weightvector is real-valued and indicates the importance of the corre-sponding component of the binary codes. Compared to the binarycodes with same length (equal toGr ), the real-valued weight vectorsignificantly enriches the representation capability. Meanwhile, thesparsity of the weight vector could preserve the high efficiency. Todemonstrate how it works, we derive the Compositional Coding forCollaborative Filtering (CCCF) framework. To tackle the intractablediscrete optimization of CCCF, we develop an efficient alternat-ing optimization method which iteratively solves mixed-integer

programming subproblems. Besides, we develop an integer approx-imation strategy for the weight vectors. This strategy can furtheraccelerate the recommendation speed. We conduct extensive experi-ments in which our promising results show that the proposed CCCFmethod not only improves the accuracy but also boosts retrievalefficiency over state-of-the-art binary coding methods.

2 PRELIMINARIESIn this section, we first review the two-stage hashing method forcollaborative filtering. Then, we introduce the direct discrete opti-mization method, which has been used in Discrete CollaborativeFiltering (DCF) [30]. Finally, we discuss the limitation of binarycodes in representation capability.

2.1 Two-stage Hashing MethodMatrix Factorization (MF) is the most successful and widely usedCF based recommendation method. It represents users and itemswith real-valued vectors. Then an interaction of the correspondinguser and item can be efficiently estimated by the inner product.Formally, given a user-item interaction matrix R ∈ Rm×n withmusers and n items. Let ui ∈ Rr and vj ∈ Rr denote the latent vectorfor user i and item j respectively. Then, the predicted preference ofuser i towards item j is formulated as r̂i j = u⊤i vj . To learn all userlatent vectors U = [u1, . . . , um ]⊤ ∈ Rm×r and item latent vectorsV = [v1, . . . , un ]⊤ ∈ Rn×r , MF minimizes the following regularizedsquared loss on the observed ratings:

arg minU,V

∑(i, j)∈V

(Ri j − u⊤i vj )2 + λR(U,V), (1)

where V denotes the all observed use-item pairs and R(U,V) isthe regularization term with respect to U and V controlled byλ > 0. To improve recommendation efficiency, after we obtainthe optimized user/item real-valued latent vectors, the two-stagehashing method use binary quantization (rounding off [33] or ro-tate [18, 35]) to convert the continuous latent representations intobinary codes. Let us denote B = [b1, . . . , bm ]⊤ ∈ {±1}m×r andD = [d1, . . . , dn ]⊤ ∈ {±1}n×r respectively as r -length user/itembinary codes, then the inner product between the binary codes ofuser and item can be formulated as b⊤i dj = 2H (bi , dj ) − r ,whereH (·) denotes the Hamming similarity. Based on fast bit operations,Hamming distance computation is extremely efficient. However,this method usually incurs a large quantization error since the bi-nary bits are obtained by thresholding real values to integers, andthus it cannot preserve the original data geometry in the continuousvector space [30].

2.2 Direct Discrete Optimization MethodTo circumvent the above issues, direct discrete optimization methodhas been proposed in Discrete Collaborative Filtering (DCF) [30]and its extension [12, 16, 31]. More formally, it learns the binarycodes by optimizing the following objective function:

arg minB,D

∑(i, j)∈V

(Ri j − b⊤i dj )2 + λR(B,D)

s .t . B ∈ {±1}m×r ,D ∈ {±1}n×r . (2)

Page 3: Compositional Coding for Collaborative Filtering · to them which is categorized as a Drama movie. The real-valued vectors could easily preserve the original data geometry in the

Compositional Coding for Collaborative Filtering SIGIR’19, 2019, July 2019, Paris, France

This objective function is similar to that of a conventional MF task,except the discrete constraint on the user/item embeddings. Byadditionally imposing balanced and de-correlated constraints, itcould derive compact yet informative binary codes for both usersand items.

However, compared to real-valued vectors, CF with binary codesnaturally suffers from low accuracy due to limited representationcapability in each bit. Specifically, the r -dimensional real-valuedvector space have infinite possibility to model users and items,while the number of unique binary codes in Hamming space is2r . An alternative way to resolve this issue is to use a longer code.Unfortunately, it will adversely hurt the generalization of the modelespecially in the sparse scenarios.

−1,+1,+1, +1

−1,−1,+1, +1

+1,−1,−1, +1

.

𝒃𝒊𝟐

𝒃𝒊𝑮

𝒃𝒊𝟏

−𝟏+ 𝟏+ 𝟏+ 𝟏

User 𝒊

−𝟏+ 𝟏+ 𝟏+ 𝟏

−𝟏+ 𝟏+ 𝟏+ 𝟏

𝒅𝒋𝟏

𝒅𝒋𝟐

𝒅𝒋𝑮

𝜼𝒊𝟏, 𝝃𝒋

𝟏

𝜼𝒊𝟐, 𝝃𝒋

𝟐

𝜼𝒊𝒒, 𝝃𝒋

𝒒

Component weight

Prediction

Item 𝒋

𝟎. 𝟕𝟐 = 𝟎. 𝟔 × 𝟏. 𝟐

𝟎 = 𝟎 × 𝟎. 𝟖

𝟎 = 𝟎. 𝟗 × 𝟎

Figure 2: Illustration of how the compositional codingframework makes a prediction for user i given item j.

3 COMPOSITIONAL CODING FORCOLLABORATIVE FILTERING

3.1 IntuitionIn this work, we attempt to improve the efficiency of CF withouthurting the prediction performance. We propose a new user/itemrepresentation named “Compositional Coding", which utilizes boththe accuracy of real-valued vectors and the efficiency of binarycodes to represent users and items. Since the fixed distance betweeneach pair of binary codes impedes them from modeling differentmagnitudes of relationships between users and items, we assignreal-valued weight to each bit to remarkably enrich their representa-tion capability. Besides, the importance parameters could implicitlyprune unimportant bits by setting their importance parametersclose to 0, which naturally reduces the computation cost.

3.2 OverviewIn general, the proposed compositional coding framework assumeseach user or item is represented byG components of r -dimensionalbinary codes (r is relatively small) and oneG-dimensional sparseweight vector. Formally, denote by

b(1)i , . . . , b(G)i ∈ {±1}r ,ηi =

(η(1)i , · · · ,η

(G)i

)∈ RG

the compositional codes for the i-th user, and

d(1)j , . . . , d(G)j ∈ {±1}r , ξ j =

(ξ(1)j , · · · , ξ

(G)j

)∈ RG

the compositional codes for the j-th item, respectively. Then thepredicted preference of user i for item j is computed by taking theweighted sum of the inner product with respect to each of the Gcomponents:

r̂i j =G∑k=1

w(k )i j (b

(k)i )⊤d(k )j , (3)

wherew(k )i j = η(k )i · ξ

(k )j is the importance weight of the k-th com-

ponent of binary codes with respect to user i and item j.Figure 2 illustrates how to estimate a user-item interaction us-

ing compositional codes. The inner product of each componentof binary codes (b(k )i )

⊤d(k )j can be efficiently computed in Ham-ming space using fast bit operations. The importance weight ofthe k-th component of binary codes w(k )i j is obtained through a

multiplication operation over η(k )i and ξ(k)j , which ensures that

the importance weight will be assigned a high value if and only ifboth user and item weights are large and will become zero if eitherof them is zero. It is worth noting that we use the multiplicationinstead of addition operation over the user and item weight toachieve the high sparsity of the sparse weight vector, which canlead to a significant reduction of computation cost during onlinerecommendation.

Compared to representing users/items with binary codes, thekey advantage of the proposed framework is that the sparse real-valued weight vector substantially increases the representationcapacity of user/item embeddings by the compositional codingscheme. Recalling the example shown in Figure 1, we can preservethe movie relations by assigning themwith different weight vectors,so as to predict user-item interactions with a lower error. Anotherbenefit is mainly from the idea of compositional matrix approxima-tion [10, 32], which is more suitable for real-world recommendersystems, since the interaction matrix is very large and composedof diverse interaction behaviours. Under this view, the proposedcompositional coding framework is characterized by multiple com-ponents of binary codes. Each component of binary codes can beemployed to discover the localized relationships among certaintypes of similar users and items and thus can be more accurate inthis particular region (e.g., young people viewing action movieswhile old people viewing romance movies). Therefore, the proposedcompositional coding framework can encode users and items in amore informative way.

In addition to the improved representation and better accuracy,the compositional coding framework does not lose the high effi-ciency advantage of binary codes, and can even gain more efficiencywhen imposing sufficient sparsity. In order to show this, we analyzethe time cost of the proposed framework and compare it againstthe conventional binary coding framework. In particular, assumethe time cost of calculating the Hamming distance of r -bit codes isTh and the time cost of weighted summation of all of inner productresults is Ts , then the total cost for a user-item similarity search is

nnz(w) ×Th +Ts ,

where nnz(w) denotes the average number of non-zero values incomponent weight vector w ∈ RG , which is much smaller than Gin our approach due to the sparsity of the user and item weights.

Page 4: Compositional Coding for Collaborative Filtering · to them which is categorized as a Drama movie. The real-valued vectors could easily preserve the original data geometry in the

SIGIR’19, 2019, July 2019, Paris, France Chenghao and Tao, et al.

Similarly, the cost by conventional hashing based CF models using(G × r )-bit binary codes is G ×Th which can be higher than ours.

Remark. The proposed compositional coding framework israther general. When identical weight vectors are used, it is de-graded to the binary codes of conventional hashing based CF meth-ods. Compositional codes could also be regarded as real-valuedlatent vectors of CF model, if we adopt the identical binary codesfor each component. In summary, compositional code is a flexiblerepresentation which could benefit from both the strong representa-tion capability of real-valued vectors and the fast similarity searchof binary codes.3.3 FormulationIn this section, we instantiate the compositional coding frameworkonMatrix Factorization (MF) models and term the proposed methodas the Compositional Coding for Collaborative Filtering (CCCF).Note that the proposed compositional coding framework can bepotentially applicable to other types of CF models beyond MF.

Follow the intuition of compositional coding, we assume thatthere exists a distance function d that measures distances in thespace of users (i = 1, . . . ,m) or items (j = 1, . . . ,n). The distancefunction leads to the notion of neighborhoods of user-user and item-item pairs. Its assumption states that the rating of user-item pair(i, j) could be approximated particularly well in its neighbourhood.Thus, we identify G components surrounding K anchor points(i ′1, j

′1), . . . , (i

′G , j′G ). Follow the setting of compositional matrix ap-

proximation [9, 10], user weight η(k )i can be instantiated with anEpanechnikov kernel1 [25], which is formulated as:

η(k )i =34

(1 − d (i, i′t )2

)1[d (i, i′t ) < h

], (4)

whereh > 0 is a bandwidth parameter, 1[·] is the indicator functionand d(·) is a distance function to measure the similarity between iand i ′t (we will discuss it in Section 3.6). A large value of h impliesthatηi has a wide spread, which means most of the user componentweights are non-zero. In contrast, a small h corresponds to narrowspread of ηi and most of the user components will be zero. Itemweight ξ (k)j follows the analogous formulation. In this way, we could

endow the weight vectorw(k )i j for user-item pair, which is definedas multiplication over user and item weight, with the sparsity. Themethod for selecting anchor points for each component is importantas it may affect the values of component weights and further affectthe generalization performance. A natural way is to uniformlysample them from the training set. In this work, we run k-meansclustering with the user/item latent vectors and use theG centroidsas anchor points.

To learn the compositional codes, we adopt the squared loss tomeasure the reconstruction error as the standard MF method. Foreach set of binary codes, in order to maximize the entropy of eachbit and make it as independent as possible, we impose balancedpartition and decorrelated constraints. In summary, learning thecompositional codes for users and items can be formulated into thefollowing optimization task:

min∑(i, j )∈V

(Ri j −

G∑k=1

w (k )i j (b(k )i )⊤d(k )j

)2

1Similar to [9], we also tried uniform and triangular kernel, but the performance wasworse than Epanechnikov kernel, in agreement with the theory of kernel smoothing[25].

s.t. 1mB(k ) = 0, 1nD(k ) = 0︸ ︷︷ ︸balanced partition

, (B(k ))⊤B(k ) =mIr , (D(k ))⊤D(k ) = nIr︸ ︷︷ ︸decorrelation

B(k ) ∈ {±1}m×r , D(k ) ∈ {±1}n×r , k = 1, . . . , G . (5)

where we denote B(k ) = [b(k )1 , . . . , b(k )m ]⊤ ∈ {±1}m×r and D(k ) =

[d(k )1 , . . . , d(k )n ]⊤ ∈ {±1}n×r respectively as user and item binary

codes in the k-th set of binary codes. The problem formulatedin (5) is a mixed-binary-integer program which is a challengingtask since it is generally NP-hard and involves a combinatorialsearch overO(2Gr (m+n)). An alternative way is to impose auxiliarycontinuous variables X(k ) ∈ B and Y(k ) ∈ D for each set of binarycodes, where B(k ) = {X(k) ∈ Rm×r |1mX(k ) = 0, (X(k ))⊤X(k) =mIr } and D(k ) = {Y(k) ∈ Rn×r |1nY(k ) = 0, (Y(k ))⊤Y(k ) = nIr }.Then the balanced and de-correlated constraints can be softenedby minX(k )∈B(k ) ∥B(k) − X(k)∥F and minY(k )∈D(k ) ∥D(k ) − Y(k )∥F ,respectively. Finally, we can solve problem (5) with respect to thek-the set of codes in a computationally tractable manner:

minB(k ),D(k ),X(k ),Y(k )

∑(i, j )∈V

(Ri j −

G∑k=1

w (k )i j (b(k )i )⊤d(k )j

)2

+ α1 ∥B(k ) − X(k ) ∥F + α2 ∥D(k ) − Y(k ) ∥Fs.t. B(k ) ∈ {±1}m×r , D(k ) ∈ {±1}n×r , (6)

where α1 and α2 are tuning parameters. Since tr((B(k ))⊤B(k )) =tr((X(k))⊤X(k )) =mr , and tr((D(k ))⊤D(k )) = tr((Y(k))⊤Y(k )) = nr ,the above optimization task can be turned into the following

minB(k ),D(k ),X(k ),Y(k )

∑(i, j )∈V

(Ri j −G∑k=1

w (k )i j (b(k )i )⊤d(k )j )

2

− 2α1tr((B(k ))⊤X(k )) − 2α2tr((D(k ))⊤Y(k ))

s.t. 1mX(k ) = 0, 1nX(k ) = 0, (Y(k ))⊤Y(k ) =mIr , (Y(k ))⊤Y(k ) = nIrB(k ) ∈ {±1}m×r , D(k ) ∈ {±1}n×r , (7)

which is the proposed learning model for CCCF. Note that we donot discard the binary constraints but directly optimize the binarycodes of each component. Through joint optimization for the binarycodes and the auxiliary real-valued variables, we can achieve nearlybalanced and un-correlated binary codes. Next, we will introducean efficient solution for the mixed-integer optimization problem inEq. (7).

3.4 OptimizationWe employ alternative optimization strategy to solve the aboveproblem. Each iteration alternatively updates B(k ), D(k), X(k ) andY(k ). The details are given below.

Learning B(k ) and D(k): In this subproblem, we update B(k )

with fixed D(k ), X(k) and Y(k ). Since the objective function in Eq.(7) is based on summing over users of each component indepen-dently, so we can update binary codes for each user and item inparallel. Specifically, learning binary codes for user i with respectto component k is to solve the following optimization problem:

minb(k )i ∈{±1}r

(b(k )i )⊤( ∑j∈Vi(w (k )i j )

2d(k )j (d(k )j )⊤)b(k )i

− 2( ∑j∈Vi

r̃i j (d(k )j )⊤)b(k )i − 2α1(x(k )i )

⊤b(k )i , (8)

Page 5: Compositional Coding for Collaborative Filtering · to them which is categorized as a Drama movie. The real-valued vectors could easily preserve the original data geometry in the

Compositional Coding for Collaborative Filtering SIGIR’19, 2019, July 2019, Paris, France

where r̃i j = ri j −∑k̃,k w

k̃i j (b

k̃i )⊤dk̃j is the residual of observed

rating excluding the inner product of component k .Due to the discrete constraints, the optimization is generally

NP-hard, we adopt the bitwise learning method called DiscreteCoordinate Descent [21, 22] to update b(k)i . In particular, denotingb(k)iq as the qth bit of b(k )i and b(k )iq̄ as the rest codes excluding b(k )iq ,

DCD update b(k)iq while fixing b(k )iq̄ . Thus, the updating rule for user

binary code b(k )iq can be formulated as

b(k)iq ← sдn(O(−b̂(k )ik ,b

(k)iq )) (9)

where b̂(k )iq =∑j ∈Vi (r̃i j − (w

(k )i j )

2(d(k )jq̄ )⊤b(k )iq̄ )d

(k )jq + α1x

(k )iq , d(k )jq̄ is

the rest set of item codes excluding djk and O(x ,y) is a functionthat O(x ,y) = x if x , 0 and O(x ,y) = y otherwise. We iterativelyupdate each bit until the procedure convergence. Note that the com-putational complexity of updating B(k ) is O

(#iter (mnr2)

)which is

a critical efficiency bottleneck whenm or n is large. To efficientlycompute b̂(k )iq , we rewrite b̂(k )iq as

b̂ (k )iq =∑j∈Vi

r̃i jd(k )jq −

∑j∈Vi(w (k )i j )

2(d(k )j )⊤b(k )i d (k )jq +

∑j∈Vi

b (k )iq + α1x(k )iq ,

which reduces the computational cost to O(#iter (m + n)r2) .

Similarly, we could learn binary code for item j in component kby solving

mind(k )j ∈{±1}r

(d(k )j )⊤(

∑i∈Vj(w (k )i j )

2b(k )i (b(k )i )⊤)d(k )j

− 2(∑i∈Vj

r̃i j (b(k )i )⊤)d(k )j − 2α2(y(k )j )

⊤d(k )j .

Denote d(k )jq as the q-th bit of d(k )j and d(k )jq̄ as the rest codes exclud-

ing d(k )jq , we update each bit of d(k )j according to

d(k )jq ← sдn(O(−d̂(k)jq ,d

(k )jq )) (10)

where d̂(k )jq =∑i ∈Vj (r̃i j − (w

(k)i j )

2(b(k)iq̄ )⊤d(k )jq̄ )b

(k )iq + α2y

(k )jq .

Learning X(k) and Y(k):When fixing B(k ), learning X(k) couldbe solved via optimizing the following objective function:

maxX(k )

tr (B(k )(X(k ))⊤), 1⊤mX(k ) = 0, (X(k ))⊤X(k ) =mIr .

It can be solved by the aid of SVD according to [17]. Let B̄(k ) bea column-wise zero-mean matrix, where B̄(k )i j = B

(k)i j −

1m

∑i B(k)i j .

Assuming B̄(k) = P(k )b Σ(k )b (Q(k)b )⊤ as its SVD, where each column

of P(k )b ∈ Rm×r ′ and Q(k )b ∈ Rr×r ′ represents the left and rightsingular vectors corresponding to r ′ non-zero singular values inthe diagonal matrix Σ(k)b . Since B̄(k) and Q(k )b have the same row,we have 1⊤P(k )b = 0 due to 1⊤B̄(k ) = 0. Then we construct ma-trices P̂(k )b of size m × (r − r ′) and Q̂(k )b of size r × (r − r ′) byemploying a Gram-Schmidt process such that ( P̂(k ))⊤b P̂

(k )b = Ir−r ′ ,

[P(k )b 1]⊤P̂(k)b = 0, and (Q̂(k )b )⊤Q̂b = Ir−r ′ , [Q(k )b 1]⊤Q̂(k)b = 0. Now

we obtain a closed-form update rule for X(k ):

X(k ) ←√m[P(k )b , P̂

(k )b ][Q

(k )b , Q̂

(k )b ]⊤. (11)

In practice, to compute such an optimal X(k ), we perform the eigen-decomposition over the small r × r matrix

(B̄(k ))⊤B̄(k ) = [Q(k)b Q̂(k )b ][(Σ(k ))2 0

0 0

][Q(k )b Q̂(k )b ]

⊤,

which provides Q(k )b , Q̂(k)b , Σ

(k ), and we can obtain

P(k )b = B̄(k )Q(k )b (Σ(k ))−1.

Then matrix P̂(k )b can be obtained by the aforementioned Gram-Schmidt orthogonalization. Note that it requiresO

(r2m

)to perform

SVD, Gram-Schimdt orthogonalization and matrix multiplication.When D(k ) fixed, learning Y(k ) could be solved in a similar way:

maxY(k )

tr (D(k )(Y(k ))⊤), 1⊤nY(k ) = 0, (Y(k ))⊤Y(k ) = nIr .

We can obtain an analytic solution:

Y(k ) ←√n[P(k )d , P̂

(k )d ][Q

(k)d , Q̂

(k )d ]⊤. (12)

where each column of P(k )d and Q(k)d is the left and right singularvectors of D̄(k ) respectively. Q̂(k )d are the left singular vectors cor-responding to zero singular values of the r × r matrix (D̄(k ))⊤D̄(k ),and P̂(k )d are the vectors obtained via the Gram-Schimidt process.We summarize the solution for CCCF in Algorithm 1.

Algorithm 1 The proposed algorithm for Compositional Codingfor Collaborative Filtering (CCCF) .

Input: R ∈ Rm×nOutput: B(k ) ∈ {±1}r×m ,D(k) ∈ {±1}r×nParameters: number of componentsG, code length r , regular-ization coefficient α1,α2, bandwidth parameter hInitialize B(k ),D(k ) and X(k ),Y(k ) ∈ Rm×n by Eq. (13).while not converged dofor k = 1, · · · ,G parallel doPick anchor points (i ′k , j

′k ).

for u = 1, · · · ,m doUpdate b(k )i according to (9)

end forfor i = 1, · · · ,n do

Update d(k )j according to (10).end forUpdate X(k ) and Y(k ) according to (11) and (12).

end forend while

3.5 InitializationNote that the proposed optimization problem involves a mixed-integer non-convex problem, the initialization of model parametersplays an important role for fast convergence and for finding betterlocal optimum solution. To achieve a good initialization in an effi-cient way, we essentially relax the binary constraints in Eq. (7) intothe following optimization:

minU(k ),V(k ),X(k ),Y(k )

∑(i, j )∈V

(Ri j −

G∑k=1

w (k )i j (u(k )i )⊤v(k )j

)2(13)

Page 6: Compositional Coding for Collaborative Filtering · to them which is categorized as a Drama movie. The real-valued vectors could easily preserve the original data geometry in the

SIGIR’19, 2019, July 2019, Paris, France Chenghao and Tao, et al.

−2α1tr((B(k ))⊤X(k )) − 2α2tr((V(k ))⊤Y(k )) + α3 ∥U(k ) ∥2F + α4 ∥V(k ) ∥2Fs.t. 1mX(k ) = 0, 1nX(k ) = 0, (Y(k ))⊤Y(k ) =mIr , (Y(k ))⊤Y(k ) = nIr ,

We first initialize real-valued matrix U(k ) and V(k ) randomlyand find the feasible solution for X(k ) and Y(k) according to theabove learning method with respect to X(k ) and Y(k ). Then thealternating optimization are conducted by updating U and V withtraditional gradient descent method and updating X and Y withrespect to the similar learning method. Once we obtain the solu-tion (U(k )0 ,V

(k )0 ,X

(k )0 ,Y

(k )0 ), we can initialize CCCF with respect to

component k as:

B(k ) ← sgn(U(k )0 ), D(k ) ← sgn(V(k )0 ), X

(k ) ← X(k )0 , Y(k ) ← Y(k )0 . (14)

The effectiveness of the proposed initialization will be discussedin Section 4.1 (illustrated in Figure 14).

3.6 Distance FunctionPreviously we assume a general distance function d , which is de-fined to measure the distance between users or items so as to com-pute the component weights w(k )i and v(k )j in Eq. (4). The metriccan be constructed with side information, like users’ social link[28, 34] or using metric learning techniques [29]. However, manydatasets do not include such data. In this work, we follow the ideaof [9, 10] which factorizes the observed interaction matrix usingMF and obtain two latent representation matrices U and V for usersand items, respectively. Then the distance between two users can becomputed by the cosine distance between the obtained latent rep-resentations, which is formulated as d(ui ,uj ) = arccos

( ⟨ui ,uj ⟩∥ui ∥ · ∥uj ∥

).

The distance between two items can be computed in the same way.

3.7 ComplexityThe computational complexity of training CCCF isK times the com-plexity of learning each set of binary codes. It converges quickly inpractice, which usually takes about 4 ∼ 5 iterations in our experi-ments. For each iteration, the computational cost for updating B(k )and D(k ) is O

(#iter (m + n)r2) . In practice, #iter is usually 2 ∼ 5.

The computational cost for updating X(k ) and Y(k) is O(r2m

)and

O(r2n), respectively. Suppose the entire algorithm requires T itera-

tions for convergence, the overall time complexity for Algorithm 1isO(Tqr2(m +n)), where we foundT empirically is no more than 5.In summary, CCCF is efficient and scalable because it scales linearlywith the number of users and items.

3.8 Fast Retrieval via Integer Weight ScalingFloating-point operations over user and item weight vectors invokemore CPU cycles and are usually much slower than integer com-putation. The cost of top-k recommendation would be remarkablylower when the scalars in the weight vectors are integers insteadof floating numbers. An intuitive way is to adopt integer approxi-mation via rounding the scalars in user and item weight vectors.However, if the weights are too small, it will incur large deviation.To tackle this problem, we scale the original scalars by multiply-ing each weight by e and then approximate them with integers inpreprocessing,

η̂i ={⌊e · η(1)i ⌉, . . . , ⌊e · η

(G )i ⌉

}, ξ̂ j =

{⌊e · ξ (1)j ⌉, . . . , ⌊e · ξ

(G )j ⌉

},

where ⌊e · η(k )i ⌉ is a round function to obtain an integer approxi-mation with respect to e · η(k )i .

4 EXPERIMENTSIn order to validate the effectiveness and efficacy of the proposedCCCF method for recommender systems, we conduct an extensiveset of experiments to examine different aspects of our methodin comparison to state-of-the-art methods based on conventionalbinary coding. We aim to answer the following questions:RQ1: How does CCCF perform as compared to other state-of-the-

arts hashing based recommendation methods in terms ofboth accuracy and retrieval time?

RQ2: How do different hyper-parameter settings (e.g., number ofcomponents, and code length) affect the accuracy of CCCF ?

RQ3: How do the sparsity of component weight vectors (controlledby the bandwidth parameter h) and integer scaling (con-trolled by parameter e) affect both the accuracy and retrievalcost of CCCF? How to choose optimal values ?

RQ4: Does the representation of compositional codes in CCCFenjoy a much stronger representation capability than thetraditional binary codes in DCF given the same model size ?

4.1 Experimental Settings4.1.1 Datasets and Settings. We run our experiments on three

public datasets: Movielens 1M2, Amazon and Yelp3 which arewidelyused in the literature. All of these ratings range from 0 to 5. Consid-ering the severe sparsity of Yelp and Amazon original datasets, wefollowed the conventional filtering strategy [20] by removing usersand items with less than 10 ratings. The statistics of the filtereddatasets are shown in Table 1. For each user, we randomly sampled70% ratings as training data and the rest 30% for test. We repeatedfor 5 random splits and reported the averaged results.

Dataset #Ratings #Items #Users #densityMovielens 1M 1,000,209 3900 6040 4.2%Yelp 696,865 25,677 25,815 0.11%Amazon 5,057,936 146,469 189,474 0.02%Table 1: Summary of datasets in our experiments.

4.1.2 Parameter Settings and Performance Metrics. ForCCCF, we vary the number of components and the code length ofeach component in range {4, 8, 12, 16}. The hyper-parameters α andβ are tuned within {10−4, 10−3, . . . , 102}. Grid search is performedto choose the best parameters on the training split. We evaluate ourproposed algorithms by Normalized Discounted Cumulative Gain(NDCG) [6], which is probably the most popular ranking metricfor capturing the importance of retrieving good items at the topof ranked lists. The average NDCG at cut off [2, 4, 6, 8, 10] overall users is the final metric. A higher NDCG@K reflects a betteraccuracy of recommendation performance.

2http://grouplens.org/datasets/movielens3 http://www.yelp.com/dataset challenge

Page 7: Compositional Coding for Collaborative Filtering · to them which is categorized as a Drama movie. The real-valued vectors could easily preserve the original data geometry in the

Compositional Coding for Collaborative Filtering SIGIR’19, 2019, July 2019, Paris, France

4.1.3 Baseline Methods and Implementations. To validatethe effectiveness of CCCF, we compare it with several state-of-the-art real-valued CF methods and hashing-based CF methods:• MF: This is the classic Matrix Factorization based CF algorithm[8], which learns real-valued user and item latent vectors inEuclidean space.• BCCF: This is a two-stage binarized CF method [35] with arelaxation stage and a quantization stage. At these two stages,it successively solves MF with balanced code regularization andapplies orthogonal rotation to obtain user codes and item codes.• DCF: This is the first method [30] directly tackles a discrete op-timization problem for seeking informative and compact binarycodes for users and items.• DCMF: This is the state-of-the-art binarized method [12] forCF with side information. It extends DCF by encoding the sidefeatures as the constraints for user codes and item codes.The CCCF algorithm empirically converges very fast and using

the initialization generally helps as shown in Figure 3.

0 2 4 6 8 10 12 14 16Iteration

3

3.5

4

4.5

5

5.5

6

Obj

ectiv

e va

lue

MovieLens 1Mwithout initwith init

X 104

0 2 4 6 8 10 12 14 16Iteration

0.680.7

0.720.740.760.780.8

0.82

ND

CG

@10

MovieLens 1M

without initwith init

Figure 3: Convergence of the overall objective values andNDCG@10 of CCCF with/without initialization on theMovielens 1M dataset. The use of the proposed initializationleads to faster convergence and better results.

4.2 Experimental Results4.2.1 Comparisons with State-of-the-arts (RQ1). Figure 4

shows the results of top-k recommendation with k setting from2 to 10. Note that the number of components G is fixed to 8 andthe code length of each component varies in {4, 6, 8, 10, 12, 14, 16}.For fair comparison, the code length of DCF and the rank of MFare equal to the total bits of CCCF, which are equivalent to thesummation of each component’s code length of CCCF (rG), so thatthe performance gain is not from increasing model complexity.From Figure 4, we can draw the following major observations:

First of all, we observe that CCCF considerably outperformsBCCF, DCF and DCMF which are the state-of-the-art hashing basedCF methods. The performance of CCCF and DCF continiously in-crease as we increase the code length. Surprisingly, CCCF caneven achieve remarkable superior performance using only 32 bitsin comparison to the DCF using 128 bits on three datasets. Thisimprovement indicates the impressive effectiveness of learningcompositional codes.

Second, between baseline methods, DCF consistently outper-forms BCCF, while slightly underperforms DCMF. This is consistentwith the findings in [30] that the performance of direct discrete opti-mization could surpass that of the two-stage methods. Besides, sideinformation makes user codes and item codes more representative,which improves the recommendation performance.

Moreover, it is worth mentioning that CCCF outperforms MF,which is a real-valued CF method, particularly on the Amazon andYelp dataset. The reasons for this are two-fold. First, compositionalstructure of CCCF has a much stronger representation capabilitywhich could discover complex relationships among users and items.Second, the higher sparsity of the dataset makes MF easy to overfit,whereras the binarized and sparse parameters in CCCF could alle-viate this issue. This finding again demonstrates the effectivenessof our method.

Dataset CCCF MF DCFTime Time Speedup Time Speedup

Movielens 1M 7.12 49.66 ×6.97 10.64 ×1.49Amazon 831.34 5917.56 ×7.12 1231.35 ×1.48Yelp 184.48 1450.25 ×7.86 264.65 ×1.43

Table 2: Retrieval time (in seconds) of recommendationmethods on three datasets, ’Speedup’ indicates the speedup(×) of CCCF (G = 8, r = 16) over baselines.

Finally, Table 2 shows the total time cost (in seconds) taken byeach method to generate the top-k item list of all items. Overall, thehashing based methods (DCF and CCCF) outperform real-valuedmethods (MF), indicating the great advantage of binarizing the real-valued parameters. Moreover, CCCF shows superior retrieval timecompared to DCF while enjoying a better accuracy. Thus, CCCF isa suitable model for large-scale recommender systems where theretrieval time is restricted within a limited time quota.

4.2.2 Impact ofHyper-parameterG and r (RQ2). OurCCCFhas two key parameters, the number of components G and codelength r , to control the complexity and capacity of CCCF. Figure5(a) and 5(b) evaluate how they affect the recommendation perfor-mance under varied total bits and fixed total bits. In Figure 5(a), wevary the code length r from 4 to 16 and the component numberG from 1 to 16. We can see that increasing G leads to continuedimprovements. When G is larger than 8, the improvement tendsto become saturated as the number of components increases. Inaddition, a larger value ofG would lead to relatively longer trainingtime. Similar observations can be found from the results of hyper-parameter r evaluation. In Figure 5(b), we fix the total bits rG inrange {32, 64, 96, 128} and varies the component number G from 1to 32. It should be noted that when G = 1, CCCF model is identicalto DCF model. As we gradually increase component numberG , therecommendation performance grows since the real-valued com-ponent weight could enhance the representation capability. Thebest recommendation performance is achieved when G = 8 or 16.When G is larger than the optimal values, increasing G will hurtthe performance. The main reason is that we fix the total bits rG , sothat larger values of G lead to smaller values of r . This will reducethe learning space of CCCF since the component weight is calcu-lated by a predefined distance function which does not considerthe rating information.

4.2.3 Impact of sparsity of componentweight and integerweight scaling (RQ3). To demonstrate the effectiveness of theinteger weight scaling in accuracy and retrieval time , we run twoversions of CCCF: EXACT and IWS. EXACT does not adopt theproposed integer weight scaling strategy. IWS uses the integerweight scaling described in Section 3.8. Figure 6 summarizes the

Page 8: Compositional Coding for Collaborative Filtering · to them which is categorized as a Drama movie. The real-valued vectors could easily preserve the original data geometry in the

SIGIR’19, 2019, July 2019, Paris, France Chenghao and Tao, et al.

2 4 6 8 10K

0.3

0.4

0.5

0.6

0.7

0.8

0.9

ND

CG

@K

MovieLens1M bits = 32

CCCFMFDCMFDCFBCCF

2 4 6 8 10K

0.3

0.4

0.5

0.6

0.7

0.8

0.9

ND

CG

@K

MovieLens1M bits = 64

CCCFMFDCMFDCFBCCF

2 4 6 8 10K

0.3

0.4

0.5

0.6

0.7

0.8

0.9

ND

CG

@K

MovieLens1M bits = 96

CCCFMFDCMFDCFBCCF

2 4 6 8 10K

0.4

0.5

0.6

0.7

0.8

0.9

ND

CG

@K

MovieLens1M bits = 128

CCCFMFDCMFDCFBCCF

2 4 6 8 10K

0.5

0.6

0.7

0.8

0.9

ND

CG

@K

Amazon bits = 32

CCCFMFDCMFDCFBCCF

2 4 6 8 10K

0.5

0.6

0.7

0.8

0.9

ND

CG

@K

Amazon bits = 64

CCCFMFDCMFDCFBCCF

2 4 6 8 10K

0.5

0.6

0.7

0.8

0.9

ND

CG

@K

Amazon bits = 96

CCCFMFDCMFDCFBCCF

2 4 6 8 10K

0.5

0.6

0.7

0.8

0.9

ND

CG

@K

Amazon bits = 128

CCCFMFDCMFDCFBCCF

2 4 6 8 10K

0.3

0.4

0.5

0.6

0.7

0.8

ND

CG

@K

Yelp bits = 32

CCCFMFDCMFDCFBCCF

2 4 6 8 10K

0.3

0.4

0.5

0.6

0.7

0.8

ND

CG

@K

Yelp bits = 64

CCCFMFDCMFDCFBCCF

2 4 6 8 10K

0.3

0.4

0.5

0.6

0.7

0.8

ND

CG

@K

Yelp bits = 96

CCCFMFDCMFDCFBCCF

2 4 6 8 10K

0.3

0.4

0.5

0.6

0.7

0.8

ND

CG

@K

Yelp bits = 128

CCCFMFDCMFDCFBCCF

Figure 4: Item recommendation performance comparison of NDCG@K with respect to code length in bits.

1 4 8 12 16Number of Component

0.76

0.79

0.82

0.85

0.88

ND

CG

@10

Amazon

r = 16r = 12r = 8r = 4

1 4 8 12 16Number of Component

0.64

0.68

0.72

0.76

0.8

ND

CG

@10

Yelp

r = 16r = 12r = 8r = 4

(a) Varied total bits (rG)

1 4 8 16 32Number of Component

0.76

0.79

0.82

0.85

0.88

ND

CG

@10

Amazon

bits = 128bits = 96bits = 64bits = 32

1 4 8 16 32Number of Component

0.64

0.67

0.7

0.73

0.76

0.79

ND

CG

@10

Yelp

bits = 128bits = 96bits = 64bits = 32

(b) Fixed total bits (rG)

Figure 5: Performance of CCCF with respect to code length in bits (r) and number of components (G).

speedup of the two versions of CCCF. The retrieval cost is notsensitive to the integer scaling parameter e and we set e to 100. Wecan see the gap between the versions is consistent over Amazonand Yelp dataset. The low cost of IWS validates the effectivenessof the integer scaling which replaces the floating-point operationswith the fast integer computation.

0 4 8 12 16 20 24r

0

400

800

1200

1600

2000

2400

Ret

rieva

l tim

e(se

cond

s) Amazon EXACT IWS

0 4 8 12 16 20 24r

0

150

300

450

600

750

Ret

rieva

l tim

e(se

cond

s) Yelp EXACT IWS

Figure 6: Retrieval cost of the naive version and IWS version.

Figure 7 shows the impact of hyper-parameter e on the accuracyof CCCF. We can find that when e is smaller than 100, the increaseof e leads to gradual improvements. When e is larger than 100,further increasing its value cannot improve the performance. Thisindicates that IWS is relatively insensitive when e is sufficientlylarge. We thus suggest to set e to 100.

5 10 25 50 100 1000e

0.760.770.780.790.8

0.810.820.83

ND

CG

@10

MovieLens1M

ExactIWS

5 10 25 50 100 1000e

0.690.7

0.710.720.730.740.750.760.77

ND

CG

@10

Yelp

ExactIWS

Figure 7: Performance of CCCF with different e values.

To reveal the impact of sparsity of component weight in accuracyand retrieval cost, we vary the bandwidth hyperparameter h from0.5 to 1. It is obvious that decreasing the value of h will increasingthe sparsity of component weights. Figure 8 shows the accuracyand retrieval cost of CCCF for differnt h. First, we can see thatthe retrieval cost of CCCF continuously drop as we decrease thevalues of h since high sparsity lead to fast computation. Second, weobserve that the best recomendation performance is achieved whenh = 0.7 ∼ 0.8. When h is smaller than 0.7 ∼ 0.8, increasing thesparsity will make CCCF robust to overfitting. However, when thesparsity level is quite high, the CCF model might not be informativeenough for prediction and thus suffer performance drop.

Page 9: Compositional Coding for Collaborative Filtering · to them which is categorized as a Drama movie. The real-valued vectors could easily preserve the original data geometry in the

Compositional Coding for Collaborative Filtering SIGIR’19, 2019, July 2019, Paris, France

0.5 0.6 0.7 0.8 0.9 1h

0

0.1

0.2

0.3

0.4

0.5

Spar

sity

MovieLens1M

0.77

0.78

0.79

0.8

0.81

0.82

NDCG

@10

SparsityNDCG@10

0.5 0.6 0.7 0.8 0.9 1h

0

0.1

0.2

0.3

0.4

0.5

Spar

sity

Yelp

0.72

0.73

0.74

0.75

0.76

0.77

NDCG

@10

SparsityNDCG@10

0.5 0.6 0.7 0.8 0.9 1h

0

0.1

0.2

0.3

0.4

0.5

Spar

sity

MovieLens1M

2

4

6

8

10

12Re

trie

val t

ime(

seco

nds)

SparsityRetrieval time

0.5 0.6 0.7 0.8 0.9 1h

0

0.1

0.2

0.3

0.4

0.5

Spar

sity

Yelp

100

140

180

220

260

300

Retr

ieva

l tim

e(se

cond

s)

SparsityRetrieval time

Figure 8: Item recommendation performance and retrievalcost of CCCFwith differenth values (controlling the sparsityof user/item component weights).

4.2.4 Item Embeddings Visualization (RQ4). The key ad-vantage of compositional codes is the stronger representationalcapability in comparison to binary codes. Therefore we visualizethe learned item embeddings of the movielens 1M dataset whereitems are indicated as movies. We use the item representationslearned by DCF and CCCF as the input to the visualization tool t-SNE [19]. As a result, each movie is mapped into a two-dimensionalvector. Then we can visualize each item embedding as a point ona two dimensional space. For items which are labelled as differentgenres, we adopt different colors on the corresponding points. Thus,a good visualization result is that the points of the same color arecloser to each other. The visualization result is shown in Figure9. We can find that the result of DCF is unsatisfactory since thepoints belonging to different categories are mixed with each other.For CCCF, we can observed clear clusters of different categories.This again validates the advantage of much stronger representationpower of the compositional codes over traditional binary codes.

-30 -20 -10 0 10 20 30-20

-10

0

10

20DCF

-30 -20 -10 0 10 20 30 40-10

-5

0

5

10

15CCCF

Figure 9: Visualization of Moivelens 1M dataset. Each pointindicates one item embedding (movie). The color of a pointindicates the genre of the movie.

5 RELATEDWORKAs a pioneer work, Locality-Sensitive Hashing has been adopted forgenerating hash codes for Google News readers based on their clickhistory [4]. Following this work, random projection was applied formapping learned user/item embeddings from matrix factorizationinto the Hamming space to obtain binary codes for users and items[7]. Similar to the idea of projection, Zhou et al. [35] generatedbinary codes from rotated continuous user/item representations by

running Iterative Quantization. In order to derive more compactbinary codes, the de-correlated constraint over different binarycodes was imposed on user/item continuous representations andthen rounded them to produce binary codes [18]. The relevant workcould be summarized as two independent stages: relaxed learningof user/item representations with some specific constraints and sub-sequent bianry quantization. However, such two-stage approachessuffer from a large quantization loss according to [30], so directoptimization of matrix factorization with discrete constraints wasproposed. To derive compact yet informative binary codes, the bal-anced and de-correlated constraints were further imposed [30]. Inorder to incorporate content information from users and items,content-aware matrix factorization and factorization machine withbinary constraints was further proposed [12, 16]. For dealing withsocial information, a discrete social recommendation model wasproposed in [15].

Recently, the idea of compositional codes has been explored inthe compression of feature embedding [1, 23, 24], which has becomemore and more important in order to deploy large models to smallmobile devices. In general, they composed the embedding vectorsusing a small set of basis vectors. The selection of basis vectorswas governed by the hash code of the original symbols. In thisway, compositional coding approaches could maximize the storageefficiency by eliminating the redundancy inherent in representingsimilar symbols with independent embeddings. In contrast, thiswork employs compositional codes to address the inner productsimilarity search problem in recommender systems.

6 CONCLUSION AND FUTUREWORKThis work contributes a novel and much more effective frameworkcalled Compositional Coding for Collaborative Filtering (CCCF).The idea is to represent each user/item by multiple componentsof binary codes together with a sparse weight vector, where eachelement of the weight vector encodes the importance of the corre-sponding component of binary codes to the user/item. In contrast tostandard binary codes, compositional codes significantly enrichesthe representation capability without sacrificing retrieval efficiency.To this end, CCCF can enjoy both the merits of effectiveness andefficiency in recommendation. Extensive experiments demonstratethat CCCF not only outperforms existing hashing-based binarycode learning algorithms in terms of recommendation accuracy,but also achieves considerable speedup of retrieval efficiency overthe state-of-the-art binary coding approaches. In future, we willapply compositional coding framework to other recommendationmodels, especially for the more generic feature-based models likeFactorization Machines. In addition, we are interested in employ-ing CCCF on the recently developed neural CF models to furtheradvance the performance of item recommendation.

ACKNOWLEDGMENTSThis research is supported by the National Research FoundationSingapore under its AI Singapore Programme [AISG-RP-2018-001].Xin Wang is supported by China Postdoctoral Science FoundationNo. BX201700136.

Page 10: Compositional Coding for Collaborative Filtering · to them which is categorized as a Drama movie. The real-valued vectors could easily preserve the original data geometry in the

SIGIR’19, 2019, July 2019, Paris, France Chenghao and Tao, et al.

REFERENCES[1] Ting Chen, Martin Renqiang Min, and Yizhou Sun. 2018. Learning K-way D-

dimensional Discrete Codes for Compact Embedding Representations. arXivpreprint arXiv:1806.09464 (2018).

[2] Zhiyong Cheng, Ying Ding, Lei Zhu, andMohan Kankanhalli. 2018. Aspect-awarelatent factor model: Rating prediction with ratings and reviews. In Proceedings ofthe 2018 World Wide Web Conference on World Wide Web. International WorldWide Web Conferences Steering Committee, 639–648.

[3] Zhiyong Cheng, Jialie Shen, Lei Zhu, Mohan S Kankanhalli, and Liqiang Nie.2017. Exploiting Music Play Sequence for Music Recommendation.. In IJCAI,Vol. 17. 3654–3660.

[4] Abhinandan S Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. 2007.Google news personalization: scalable online collaborative filtering. In Proceed-ings of the 16th international conference on World Wide Web. ACM, 271–280.

[5] Johan Håstad. 2001. Some optimal inapproximability results. Journal of the ACM(JACM) 48, 4 (2001), 798–859.

[6] Kalervo Järvelin and Jaana Kekäläinen. 2000. IR evaluation methods for retrievinghighly relevant documents. In Proceedings of the 23rd annual international ACMSIGIR conference on Research and development in information retrieval. ACM,41–48.

[7] Alexandros Karatzoglou, Alex Smola, and Markus Weimer. 2010. Collaborativefiltering on a budget. In Proceedings of the Thirteenth International Conference onArtificial Intelligence and Statistics. 389–396.

[8] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems. Computer 8 (2009), 30–37.

[9] Joonseok Lee, Samy Bengio, Seungyeon Kim, Guy Lebanon, and Yoram Singer.2014. Local collaborative ranking. In Proceedings of the 23rd international confer-ence on World wide web. ACM, 85–96.

[10] Joonseok Lee, Seungyeon Kim, Guy Lebanon, and Yoram Singer. 2013. Locallow-rank matrix approximation. In International Conference on Machine Learning.82–90.

[11] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017.Neural attentive session-based recommendation. In Proceedings of the 2017 ACMon Conference on Information and Knowledge Management. ACM, 1419–1428.

[12] Defu Lian, Rui Liu, Yong Ge, Kai Zheng, Xing Xie, and Longbing Cao. 2017.Discrete Content-aware Matrix Factorization. In Proceedings of the 23rd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. ACM,325–334.

[13] Chenghao Liu, Steven CH Hoi, Peilin Zhao, Jianling Sun, and Ee-Peng Lim.2016. Online adaptive passive-aggressive methods for non-negative matrixfactorization and its applications. In Proceedings of the 25th ACM International onConference on Information and Knowledge Management. ACM, 1161–1170.

[14] Chenghao Liu, Tao Jin, Steven CH Hoi, Peilin Zhao, and Jianling Sun. 2017.Collaborative topic regression for online recommender systems: an online andBayesian approach. Machine Learning 106, 5 (2017), 651–670.

[15] Chenghao Liu, Xin Wang, Tao Lu, Wenwu Zhu, Jianling Sun, and Steven CHHoi. 2019. Discrete Social Recommendation. In Thirty-Third AAAI Conference onArtificial Intelligence.

[16] Han Liu, Xiangnan He, Fuli Feng, Liqiang Nie, Rui Liu, and Hanwang Zhang.2018. Discrete Factorization Machines for Fast Feature-based Recommendation.arXiv preprint arXiv:1805.02232 (2018).

[17] Wei Liu, CunMu, Sanjiv Kumar, and Shih-Fu Chang. 2014. Discrete graph hashing.In Advances in Neural Information Processing Systems. 3419–3427.

[18] Xianglong Liu, Junfeng He, Cheng Deng, and Bo Lang. 2014. Collaborativehashing. In Proceedings of the IEEE conference on computer vision and patternrecognition. 2139–2146.

[19] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, Nov (2008), 2579–2605.

[20] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedingsof the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press,452–461.

[21] Fumin Shen, YadongMu, Yang Yang,Wei Liu, Li Liu, Jingkuan Song, andHeng TaoShen. 2017. Classification by Retrieval: Binarizing Data and Classifier. (2017).

[22] Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. 2015. Superviseddiscrete hashing. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. 37–45.

[23] Raphael Shu and Hideki Nakayama. 2017. Compressing Word Embeddings viaDeep Compositional Code Learning. arXiv preprint arXiv:1711.01068 (2017).

[24] Dan Tito Svenstrup, Jonas Hansen, and Ole Winther. 2017. Hash embeddingsfor efficient word representations. In Advances in Neural Information ProcessingSystems. 4928–4936.

[25] Matt PWand andMChris Jones. 1994. Kernel smoothing. Chapman andHall/CRC.[26] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. 2012. Semi-supervised hashing for

large-scale search. IEEE Transactions on Pattern Analysis and Machine Intelligence34, 12 (2012), 2393–2406.

[27] Xin Wang, Steven CH Hoi, Chenghao Liu, and Martin Ester. 2017. Interactive so-cial recommendation. In Proceedings of the 2017 ACM on Conference on Informationand Knowledge Management. ACM, 357–366.

[28] Xin Wang, Wei Lu, Martin Ester, Can Wang, and Chun Chen. 2016. Socialrecommendation with strong and weak ties. In Proceedings of the 25th ACMInternational on Conference on Information and Knowledge Management. ACM,5–14.

[29] Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. 2003. Distancemetric learning with application to clustering with side-information. In Advancesin neural information processing systems. 521–528.

[30] Hanwang Zhang, Fumin Shen, Wei Liu, Xiangnan He, Huanbo Luan, and Tat-Seng Chua. 2016. Discrete collaborative filtering. In Proceedings of the 39thInternational ACM SIGIR conference on Research and Development in InformationRetrieval. ACM, 325–334.

[31] Yan Zhang, Defu Lian, and Guowu Yang. 2017. Discrete Personalized Rankingfor Fast Collaborative Filtering from Implicit Feedback.. In AAAI. 1669–1675.

[32] Yongfeng Zhang, Min Zhang, Yiqun Liu, and Shaoping Ma. 2013. Improve collab-orative filtering through bordered block diagonal form matrices. In Proceedingsof the 36th international ACM SIGIR conference on Research and development ininformation retrieval. ACM, 313–322.

[33] Zhiwei Zhang, Qifan Wang, Lingyun Ruan, and Luo Si. 2014. Preference preserv-ing hashing for efficient recommendation. In Proceedings of the 37th internationalACM SIGIR conference on Research & development in information retrieval. ACM,183–192.

[34] Huan Zhao, Quanming Yao, James T Kwok, and Dik Lun Lee. 2017. Collabora-tive Filtering with Social Local Models. In Proceedings of the 16th InternationalConference on Data Mining. 645–654.

[35] Ke Zhou and Hongyuan Zha. 2012. Learning binary codes for collaborativefiltering. In Proceedings of the 18th ACM SIGKDD international conference onKnowledge discovery and data mining. ACM, 498–506.