CATA++: A Collaborative Dual Attentive Autoencoder Method ... · CATA++: ACollaborativeDualAttentiveAutoencoderMethodforRecommendingScientiﬁcArticles ourmodel. 3.2. Theattentiveautoencoder

CATA++: A Collaborative Dual Attentive Autoencoder Method forRecommending Scientific ArticlesMeshal Alfarhooda,∗, Jianlin Chenga,baDepartment of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USAbInformatics Institute, University of Missouri, Columbia, MO 65211, USA

ART ICLE INFOKeywords:Recommender systemsCollaborative filteringMatrix factorizationSparsity problemAttention mechanismAutoencoderDeep learning

ABSTRACTRecommender systems today have become an essential component of any commercial website. Col-laborative filtering approaches, and Matrix Factorization (MF) techniques in particular, are widelyused in recommender systems. However, the natural data sparsity problem limits their performancewhere users generally interact with very few items in the system. Consequently, multiple hybrid mod-els were proposed recently to optimize MF performance by incorporating additional contextual infor-mation in its learning process. Although these models improve the recommendation quality, there aretwo primary aspects for further improvements: (1) multiple models focus only on some portion of theavailable contextual information and neglect other portions; (2) learning the feature space of the sidecontextual information needs to be further enhanced.

In this paper, we introduce a Collaborative Dual Attentive Autoencoder (CATA++) for recom-mending scientific articles. CATA++ utilizes an article’s content and learns its latent space via twoparallel autoencoders. We employ the attention mechanism to capture the most related parts of in-formation in order to make more relevant recommendations. Extensive experiments on three real-world datasets have shown that our dual-way learning strategy has significantly improved the MFperformance in comparison with other state-of-the-art MF-based models using various experimentalevaluations. The source code of our methods is available at: https://github.com/jianlin-cheng/CATA.

1. IntroductionThe amount of data created in the last few years is over-

whelming. Interestingly, the data volume grows exponen-tially yearly compared to the years before, making the eraof big data. This motivates and attracts researchers to uti-lize this massive data to develop more practical and accuratesolutions in most computer science domains. For instance,recommender systems (RSs) are primarily a good solution toprocess massive data in order to extract useful information(e.g., users’ preferences) to help users with personalized de-cision making.

Scientific paper recommendations are very common ap-plications for RSs. They are quite useful for scholars to beaware of related work in their research area. Generally, thereare three common techniques for recommendations. First,collaborative filtering techniques (CF) are widely successfulmodels that are applied to RSs. CF models depend typicallyon usersâĂŹ ratings, such that users with similar past ratingsaremore likely to agree on similar items in the future. MatrixFactorization (MF) is one of the most popular CF techniquesfor many years due to its simplicity and effectiveness. MFhas been widely used in the recommendation literature, suchthat many proposed models are enhanced versions of MF[11, 16, 20, 10, 7]. However, CF models generally rely onlyon users’ past ratings in their learning process, and do notconsider other auxiliary data, which has been validated laterto improve the quality of recommendations. For that reason,the performance of CF models decreases significantly when

∗Corresponding [email protected] (M. Alfarhood); [email protected] (J.

Cheng)ORCID(s):

users have a limited amount of ratings data. This problem isalso known as the data sparsity problem.

More recently, a lot of effort has been conducted to in-clude item’s information along with the user’s ratings viatopic modeling [26, 27, 15]. Collaborative Topic Regres-sion (CTR) [26] for example is composed of ProbabilisticMatrix Factorization (PMF) and Latent Dirichlet Allocation(LDA) to utilize both userâĂŹs ratings and item’s reviews tolearn their latent features. By doing that, the natural sparsityproblem could be alleviated, and these kinds of approachesare called hybrid models.

Simultaneously, deep learning (DL) has gained an in-creasing attention in the recent years due to how it enhancesthe way we process big data, and to its capability of model-ing complicated data such as texts and images. Deep learn-ing takes part in the research of recommendation systems thelast few years and surpasses traditional collaborative filter-ing methods. Restricted Boltzmann Machines (RBM) [21]is one of the first works that applies DL for CF recommen-dations. However, RBM is not deep enough to learn users’tastes from users’ feedback data. Following that, Collabo-rative Deep Learning (CDL) [28] has been a very popularDL-based model, which extends the CTR model [26] by re-placing the LDA topic model with a Stacked Denoising Au-toencoder (SDAE). In addition, Deep Collaborative Filter-ing (DCF) [12] is a similar work that uses a marginalizedDenoising Autoencoder (mDA) with PMF. Lately, Collab-orative Variational Autoencoder (CVAE) [13] introduces aVariational Autoencoder (VAE) to handle items’ contents.CVAE is evaluated against CTR and CDL and the experi-mental results show that CVAE has better predictions overCTR and CDL.

Meshal Alfarhood et al.: Preprint submitted to Elsevier Page 1 of 12

arX

iv:2

002.

1227

7v2

[cs

.LG

] 1

5 M

ay 2

020

CATA++: A Collaborative Dual Attentive Autoencoder Method for Recommending Scientific Articles

However, existing recommendationmodels, such as CDLand CVAE, have two major limitations. First, they assumethat all features of the model’s inputs are equally the same incontributing to the final prediction. Second, they focus onlyon some parts of the auxiliary items data and neglect otherparts, which can be also utilized in improving recommenda-tions.

Consequently, we introduce a Collaborative Dual Atten-tive Autoencoder (CATA++) for recommending scientificpapers. We integrate the attention technique into our deepfeature learning procedure to learn from article’s textual in-formation (e.g., title, abstract, tags, and citations betweenpapers) to enhance the recommendation quality. The fea-tures learned by each attentive autoencoder are employedthen into the matrix factorization (MF) method for our fi-nal articles’ suggestions. To show the effectiveness of ourproposed model, we perform a comprehensive experimenton three real-world datasets to show how our model workscompared to multiple recent MF-based models. The resultsshow that our model can extract better features than otherbaseline models from the textual data. More importantly,CATA++ has a higher recommendation quality in the caseof high, sparse data.

The main contributions of this work are summarized inthe following points:

• We introduce CATA++, a Collaborative Dual Atten-tive Autoencoder, that has been evaluated on recom-mending scientific articles. We employ the attentiontechnique into our model, such that only relevant partsof the data can contribute more in the item content’srepresentation. This representation helps in findingsimilarities between articles.

• We exploit more article content into our deep featurelearning process. To the best of our knowledge, ourmodel is the first model that utilizes all article con-tent including title, abstract, tags, and citations all to-gether in one model by coupling two attentive autoen-coder networks. The latent features learned by eachnetwork are then integrated into the matrix factoriza-tion method for our ultimate recommendations.

• Weevaluate ourmodel using three real-world datasets.We compare the performance of our proposed modelwith five baselines. CATA++ achieves superior per-formance when the data sparsity is extremely high.

The remainder of this paper is organized in the follow-ing order. First, we explain some essential preliminaries inSection 2. Second, our model, CATA++, is demonstratedin depth in Section 3. Third, we discuss the experimental re-sults thoroughly in Section 4. Lastly, we conclude our workin Section 5.

2. PreliminariesThe essential background to comprehend our model is

explained in this section. We first describe the Matrix Fac-torization (MF) for implicit feedback problems. After that,

we demonstrate the original idea of the attention mechanismand its related work.2.1. Matrix Factorization

Matrix Factorization (MF) [11] is a very popular tech-nique among CF models. The MF works by estimating theuser-item matrix, R ∈ ℝn×m, using two matrices, U ∈ ℝn×d

and V ∈ ℝm×d , such that the dot product of U and V esti-mates the original matrix R as: R ≈ U ⋅V T . The value of dcorresponds to the dimension of the latent factors, such thatd ≪ min(n, m).

U and V are optimized through minimizing the differ-ence between the actual ratings and the predicted ratings, asthe following:

=∑

i,j∈R

Iij2(rij − uivTj )

2 +�u2‖

‖

ui‖‖2 +

�v2‖

‖

‖

vj‖

‖

‖

2 (1)

where Iij is a variable that takes value 1 if useri rates itemj ,and value 0 if otherwise. Also, ||U || calculates the Euclideannorm and �u, �v are two regularization terms preventing thevalues of U and V from being too large. This avoid modeloverfitting.

The previous objective function is designed for explicitdata. However, explicit data (e.g., users’ ratings) is not avail-able all the time. As a result, implicit feedback (e.g., users’clicks) is utilizedmore frequently in recommendations. Thus,WeightedRegularizedMatrix Factorization (WRMF) [7]mod-ifies the objective function in Equation 1 to make it work forimplicit data, as the following:

=∑

i,j∈R

cij2(pij − uivTj )

2 +�u2‖

‖

ui‖‖2 +

�v2‖

‖

‖

vj‖

‖

‖

2 (2)

where pij is the preference variable that takes value 1 if useriinteracts with itemj , and value 0 if otherwise. Also, a con-fidence variable (cij) is given to each user-item pair for im-plicit data, such that cij = a when pij = 1, and cij = b whenpij = 0, where a > b > 0.2.2. Attention mechanism

The idea of the attention mechanism is motivated by thehuman vision system and how our eyes pay attention andfocus to a specific part of an image, or specific words in asentence. In the same way, attention in deep learning canbe described simply as a vector of weights to show the im-portance of the input elements. Thus, the intuition behindattention is that not all parts of the input are equally signifi-cant, i.e., only few parts are significant for the model. Atten-tion was initially designed for image classification task [17],and then successfully applied in natural language process-ing (NLP) for machine translation task [3] when the inputand the output may have different lengths.

Attention has also been successfully applied in differentrecommendation tasks [9, 14, 25, 29, 23, 4]. For example,MPCN [25] is amulti-pointer co-attention network that takesuser and item reviews as input, and then extracts the most in-formative reviews that contribute more in predictions. Also,



Rij

Vj

Ui λu

λv

Xj

Encoder

Decoder

X`j

Zj SoftmaxX

Attention

i =1:n

Tj

Encoder

Decoder

T`j

YjSoftmax X

Attention

j =1:m

Figure 1: Collaborative Dual Attentive Autoencoder (CATA++) architecture.

D-Attn [23] uses a convolutional neural network with dualattention (local and global attention) to represent the userand the item latent representations similarly like matrix fac-torization approach. Moreover, NAIS [4] employs attentionnetwork to distinguish items in a user profile, which havemore influential effects in the model predictions.

3. MethodologyBefore describing our model thoroughly in this section,

we first define the recommendation problem with implicitdata and then followed by the illustration of our model.3.1. Problem definition

The recommendation problem with implicit data is usu-ally defined as the following:

rij =

{

1, if useri interacts with itemj0, if otherwise (3)

where the ones refer to positive (observed) feedback, andthe zeros refer to missing (unobserved) values. As a result,negative feedback is missing in this type of data. This prob-lem is also called the one-class problem. There are multi-ple approaches have been proposed in regard to this issue.One popular solution is to sample negative feedback fromthe missing values. Another solution, which we adopt in thispaper, is to assign different confidence values to all user-itempairs as we previously explained in the preliminaries section.

Table 1A summary of notations used in this paper.

Notation Meaning

n Number of usersm Number of articlesd Dimension of the latent factorsR User-article matrixU Latent factors of usersV Latent factors of articlesC Confidence matrixP Users preferences matrixXj Article’s side information, i.e., title and abstractTj Article’s side information, i.e., tags and citations

�(Xj) Mapping function for input Xj of the autoencoder (Tj) Mapping function for input Tj of the autoencoderZj Compressed representation of XjYj Compressed representation of Tj�u Regularization parameter of users�v Regularization parameter of articles

Even though our model has been applied to a ranking predi-cation problem with implicit feedback data, it could be usedfor a rating prediction problem with explicit feedback dataas well by altering the final loss function.

In the following sections, we demonstrate each part ofour model separately and we show how the recommenda-tions are generated. Table 1 summarizes all the notationsused in this paper and Figure 1 displays the architecture of



our model.3.2. The attentive autoencoder

The first part of our model is our deep, attentive autoen-coder. Generally, an autoencoder [5] is a neural networkthat is trained in an unsupervised manner. Autoencoders arepopular for dimensionality reduction, such that their input iscompressed into low-dimensional representation while thefeatures’ abstract meaning is preserved. The autoencoder’snetwork is composed of two main parts: the encoder and thedecoder. The encoder takes an input and squashes it into alatent space, Zj . The encoding function can be written as:Zj = f (Xj). On the other hand, the decoder is used thento recreate the input again, X̂j , using the latent space rep-resentation (Zj). The decoder function can be written as:X̂j = f (Zj). Each of the encoder and the decoder consistusually of multiple hidden layers. The computations of thehidden layers are described as the following:

ℎ(l) = �(ℎ(l−1)W (l) + b(l)) (4)where (l) points to the layer number,W is the weights ma-trix, b is the bias vector, and � is the Rectified Linear Unit(ReLU) activation function.

Our model takes two inputs from the article’s data,Xj ={x1, x2, ..., xs} and Tj = {t1, t2, ..., tg}, where xi and ti arereal values between [0, 1], s is the vocabulary size of thearticles’ titles and abstracts, and g is the vocabulary size ofthe articles’ tags. In other words, the inputs of our attentivenetwork are two normalized bag-of-words histograms thatrepresent the vocabularies of our articles’ textual data.

We apply the batch normalization (BN) [8] technique af-ter each layer in our autoencoder to obtain a stable distribu-tion of our output. Integrating BN into our model has an ef-fective influence on our model accuracy; because it providessome regularization in training our neural network. Further-more, we place an attention layer in the middle of our au-toencoder, such that only the significant elements of the en-coder’s output are chosen to reconstruct the original inputagain. To do that, we use the softmax(.) function to com-pute the probability distribution of the encoder’s output, asthe following:

f (zc) =ezc

∑

d ezd(5)

After that, the output of the previous function and the en-coder’s output aremultiplied by each other using the element-wise multiplication function to obtain the latent vector Zj .Finally, we choose the binary cross-entropy as our objec-tive function for each of our autoencoders, as the following:

= −∑

k

(

yk log(pk) − (1 − yk) log(1 − pk)) (6)

where yk refers to the correct values, and pk refers to thepredicted values.

The value of p that minimizes the previous loss functionthe most is when p = y, which makes it fit for our autoen-coder. To verify that, taking the derivative of the loss func-tion with respect to p results in the following:

))p

= −y(1p) − (1 − y)( −1

1 − p)

−yp

+1 − y1 − p

= 0

− y(1 − p) + (1 − y)p = 0− y + yp + p − yp = 0− y + p = 0p = y

(7)

3.3. Probabilistic matrix factorizationProbabilistic Matrix Factorization (PMF) [16] is a prob-

abilistic linear model where the prior distributions of theusers’ preferences and the latent features are drawn from theGaussian distribution. In our previous model, CATA [1], wetrain a single attentive autoencoder and incorporate its out-put into PMF. The objective function of CATA was definedas:

=∑

i,j∈R

cij2(pij−uivTj )

2+�u2‖

‖

ui‖‖2+�v2‖

‖

‖

vj − �(Xj)‖

‖

‖

2 (8)

where �(Xj) = Encoder(Xj) = Zj such that �(Xj) worksas the Gaussian prior information to vj .On the other hand, CATA++ exploits extra item contentand trains them via two separate, parallel attentive autoen-coders. We use the output of the two separated networksall together to be the prior information of the items’ latentfactors from PMF. Therefore, the new objective function ofCATA++ is modified to the following function:

=∑

i,j∈R

cij2(pij−uivTj )

2+�u2‖

‖

ui‖‖2+�v2‖

‖

‖

vj − (�(Xj) + (Tj))‖

‖

‖

2

(9)where (Tj) = Encoder(Tj) = Yj .Taking the partial derivative of with respect to ui inEquation 9 determines the values of users’ latent vectors thatminimize the the previous objective function the most, asfollows:

))ui

= −∑

jcij(pij − uivTj )vj + �uui

0 = −Ci(Pi − uiV T )V + �uui0 = −CiV Pi + CiV uiV T + �uuiV CiPi = uiV CiV

T + �uuiV CiPi = ui(V CiV T + �uI)

ui = V CiPi(V CiV T + �uI)−1

ui = (V CiV T + �uI)−1V CiPi

(10)

where I is the identity matrix.Meshal Alfarhood et al.: Preprint submitted to Elsevier Page 4 of 12


Similarly, taking the derivative of with respect to vjleads to:vj = (UCjUT + �vI)−1UCjPj + �v(�(Xj) + (Tj)) (11)Finally, we use the Alternating Least Squares (ALS) op-

timization method to update the values ofU and V . It worksby iteratively optimizing the values of U while the values ofV are fixed, and vice versa. This operation is repeated untilthe values of U and V converge.3.4. Prediction

Once we finish training our model, the model’s predic-tion scores are computed as the dot product of the latentfactors of users and articles (U and V ). Specifically, eachuser’s vector (ui) is dot product with all vectors in V , as:scoresi = uiV T . As a result, we have a vector of differentscores that represent the user’s preferences. We then sortthese scores in descending order, such that the top-K arti-cles based on those scores are recommended. We repeat thisprocess for all users. The overall process of our approach isillustrated in Algorithm 1.

Algorithm 1: CATA++ algorithm1 pre-train first autoencoder with input X;

2 pre-train second autoencoder with input T ;

3 Z ← �(X);4 Y ← (T );5 U, V ← Initialize with random values;

6 while <NOT converge> do7 for <each user i> do8 ui ← update using Equation 10;

9 end for10 for <each article j> do11 vi ← update using Equation 11;

12 end for13 end while14 for <each user i> do15 scoresi ← uiV T ;16 sort(scoresi) in descending order;

17 end for18 Evaluate the top-K recommendations;

4. ExperimentsThis section shows a comprehensive experiment in order

to address the following research questions:• RQ1: How does our model perform compared to the

state-of-the-art models? Prove with quantitative andqualitative analysis.

• RQ2: Are both autoencoders (left and right) cooper-ating with each other to enhance recommendation per-formance?

• RQ3: What is the impact of different hyper-parameterstuning (e.g. dimension of features’ latent space, num-ber of layers inside each encoder and decoder, and reg-ularization terms �u and �v) on the performance of ourmodel?

Before answering the aforementioned research questions, wefirst describe the datasets, the evaluation metrics, and thebaseline approaches against which we evaluate our model.4.1. Datasets

Weuse three real-world, scientific article datasets to eval-uate our model against the state-of-the-art models. All threedatasets are gathered from CiteULike website1. CiteULikewas a web service that let users to create their own library ofacademic publications.

First, Citeulike-a dataset, which is gathered by [26], has5,551 users, 16,980 articles, 204,986 user-article interactionpairs, 46,391 tags, and 44,709 citations between articles. Thetags are single-word keywords that are generated by CiteU-Like users when they add an article to their library. Citationsbetween articles are taken from Google Scholar2. The datasparsity of this dataset is considerably high with only around0.22% of the user-article matrix having interactions. Usershave at least 10 articles in their library.

Second, Citeulike-t dataset, which is gathered by [27],has 7,947 users, 25,975 articles, 134,860 user-article inter-action pairs, 52,946 tags, and 32,565 citations between arti-cles. This dataset is actually sparser than the first one, suchthat only 0.07% of the user-article matrix having interac-tions. Users have at least three articles in their library.

Third, Citeulike-2004-2007 dataset is three times biggerthan the previous ones with regard to the user-article ma-trix. The data values in this dataset are extracted between11-04-2004 and 12-31-2007. It is gathered by [2] and it has3,039 users, 210,137 articles, 284,960 user-article interac-tion pairs, and 75,721 tags. Also, it is worth pointing outthat citations data is not available in this dataset. This datasetis even the sparsest dataset in this experiment with sparsityequal to 99.95%. Users have at least 10 articles in their li-brary. On average, they have 94 articles in their library andeach article are added only to one user library. Also, thisdataset poses a scalability challenge for recommender sys-tems because of its size. More information about the datasetsare shown in Table 2.

Figure 2 shows the ratio of articles that are added to fiveor fewer users’ libraries. For instance, 15%, 77%, and 99% ofthe articles in Citeulike-a, Citeulike-t, and Citeulike-2004-2007, respectively, are added to five or fewer users’ libraries.Moreover, only 1% of the articles in Citeulike-a are addedonly to one user library, while the rest of the articles areadded to more than this number. On the contrary, 13%, and77% of the articles in Citeulike-t and Citeulike-2004-2007are added only to one user library. This proves the sparse-ness of the data with regard to articles as we go from onedataset to another.

1www.citeulike.org2https://scholar.google.com



Table 2Description of CiteULike datasets.

Dataset #Users #Articles #Pairs #Tags #Citations Sparsity%

Citeulike-a 5,551 16,980 204,986 46,391 44,709 99.78%Citeulike-t 7,947 25,975 134,860 52,946 32,565 99.93%

Citeulike-2004-2007 3,039 210,137 284,960 75,721 – 99.95%

We imitate the same procedure as the state-of-the-art mod-els [28, 26, 13] to preprocess our textual data. First, we com-bine the title and the abstract of each article together. Sec-ond, we remove the stop words, such that the top-N uniquewords based on the TF-IDF measurement [22] are selected.As a result, 8,000, 20,000, and 19,871 words are selected forCiteulike-a, Citeulike-t, and Citeulike-2004-2007, respec-tively, to form the bag-of-words histograms. The bag-of-words histograms are normalized into values between zeroand one based on the vocabularies’ occurrences. The aver-age number of words per article after our text preprocess-ing is 67, 19, and 55 words in Citeulike-a, Citeulike-t, andCiteulike-2004-2007, respectively.

Similarly, we preprocess the tags information, such thattags assigned to fewer than five articles are removed, andthus we get 7,386 and 8,311 tags in total for Citeulike-a andCiteulike-t, respectively. For Citeulike-2004-2007 dataset,we only keep tags that are assigned to more than 10 articles,and that results in 11,754 tags in total for this dataset. Afterthat, we create a matrix of bag-of-words histogram, Q ∈ℝm×g , to represent the article-tag relationship, with m beingthe number of articles, and g being the number of tags. Thismatrix is filled with ones and zeros, such that:

qat =

{

1, if tagt is assigned to articlea0, if otherwise (12)

Also, citations between articles are integrated in this ma-trix, such that if articlex cites articley, then all the ones invector qy of the original matrix are copied into vector qx. Wedo that to capture the article-article relationship.4.2. Evaluation methodology

To generate our training and testing data, we emulate thesame procedure done in the state-of-the-art models [13, 28,27]. We generate the training and the testing data based ontwo settings, i.e., sparse and dense settings. To make thesparse (P = 1) and the dense (P = 10) datasets, we se-lect P random articles from the user’s library for the train-ing data, while we select the remaining articles for the testingdata. The data splitting is repeated four times. One split isused to run a validation experiment to fine-tune the hyper-parameters of each model, while the remaining three splitsare utilized to report the average performance of each model.

For our evaluation metrics, we adopt recall and normal-ized Discounted Cumulative Gain (nDCG). Recall per useris computed as the following:

recall@K =Testing Articles ∩ K Recommended Articles

| Testing Articles |

1 2 3 4 5N

0.0

0.2

0.4

0.6

0.8

1.0

ratio

of a

rticle

s

citeulike-aciteulike-tciteulike-2004-2007

Figure 2: Ratio of articles that are added to ≤ N users’ libraries.

(13)However, the recall metric doesmeasure the ranking qual-

ity within the top-K recommendations. Therefore, we usenDCG aswell to show the ability of themodel to recommendarticles at the top of the ranking list. nDCG is computed asthe following:

nDCG@K = 1|U |

|U |∑

u=1

DCG@KIDCG@K

(14)

such that:

DCG@K =K∑

i=1

�(i)log2(i + 1)

IDCG@K =min(R,K)∑

i=1

1log2(i + 1)

(15)

such that |U | refers to the number of users, i is the articlerank, R is the number of relevant articles, and �(i) is a vari-able that takes value 1 if the article is relevant, and 0 if oth-erwise.4.3. Baselines

Weevaluate our approach against the followingmethods:• POP: Popular predictor is a non-personalized recom-

mender system. It recommends the most popular ar-ticles in the training set to all users. It is widely used



Table 3Comparison between all models about which data is used in their model training.

Approach User-article matrix Side information

Title Abstract Tags Citations

POP ✓ – – – –CDL ✓ ✓ ✓ – –CVAE ✓ ✓ ✓ – –

CVAE++ ✓ ✓ ✓ ✓ ✓

CATA ✓ ✓ ✓ – –CATA++ ✓ ✓ ✓ ✓ ✓

as the baseline for personalized recommender systemsmodels.

• CDL: Collaborative Deep Learning (CDL) [28] is aprobabilistic model that jointly models both ratingsdata and items data using a stacked denoising autoen-coder (SDAE) and a probabilistic matrix factorization(PMF).

• CVAE: CollaborativeVariational Autoencoder (CVAE)[13] is a similar approach to CDL [28]. However, ituses a variational autoencoder (VAE) instead of SDAEto incorporate item content into PMF.

• CVAE++: We modify the implementation of CVAE[13] to include two variational autoencoders to engagemore side information into the model training, likewhat CATA++ does. As a result of adding anotherVAE into the model, we change the loss function ac-cordingly such that the loss of the item latent variablebecomes: (v) = �v

∑

j‖

‖

‖

vj − (zj + yj)‖

‖

‖

2

2, where zj

is the latent content variable of the first VAE, and yjis the latent content variable of the second VAE.• CATA: Collaborative Attentive Autoencoder (CATA)

[1] is our preliminary work that uses a single attentiveautoencoder (AAE) to train article content, i.e., titleand abstract.

Table 3 gives more clarifications about which part of thearticle’s data is involved in each model training. As the tableshows, only CATA++ and CVAE++ use all the availableinformation for training their model.

Table 4 also reports the best values of �u and �v forCDL, CVAE, CVAE++, CATA, and CATA++ based onthe validation experiment. We use a grid search of the fol-lowing values {0.01, 0.1, 1, 10, 100} to obtain the optimalvalues. Moreover, for CDL, we set a=1, b=0.01, d=50,�n=1000, and �w=0.0001. Also, we use a 2-layer SDAEnetwork architecture that has a structure of "#Vocabularies-200-50-200-#Vocabularies" to run their code on our datasets.Similarly, for CVAE andCVAE++, we also set a=1, b=0.01,and d=50. A three-layer VAE network architecture, whichis similar to the structure reported in their paper, is usedwith a structure equivalent to "#Vocabularies-200-100-50-100-200-#Vocabularies". Finally, for CATA and CATA++,

we also set a=1, b=0.01, and d=50. A four-layer AAE net-work architecture in the form of "#Vocabularies-400-200-100-50-100-200-400-#Vocabularies" is used train our mod-els.4.4. Experimental results

In this section, we address the research questions that arepreviously outlined at the start of the experiments section.4.4.1. RQ1

To show how our model performs, we run quantitativeand qualitative comparisons to address this matter. Figures3 and 4 display the performance of the top-K recommen-dations using the sparse data in terms of recall and nDCGfor all the three datasets. Similarly, Figures 5 and 6 displaythe performance using the dense data in terms of recall andnDCG as well.

First, the sparse cases in Figures 3 and 4 are more chal-lenging for any recommendation model; due to the scarcefeedback data for the model’s training. For the sparse cases,CATA++ achieves a superior performance relative to otherMF-based models in all datasets based on the both metrics.More importantly, CATA++ beats the best model amongall the baselines, CVAE++, by a wide margin in Citeulike-2004-2007 dataset, where it is actually sparser and containsa huge number of articles. This proves the validity of ourmodel to work with sparse data. Second, for the dense cases,CATA++ again beats the other models as Figures 5 and 6display. In reality, several of the existed methods actuallyperform well under this setting, but poorly when the sparsityis high. For example, CDL fails to beat POP in Citeulike-tdataset under the sparse setting, and then easily beats POPunder the dense setting as Figures 3b and 5b show.

Consequently, this experiment validates the ability of ourmodel to overcome the limitations mentioned in the begin-ning of this paper. For instance, among all the five baselinemodels, CVAE++ has the best performance, which empha-sizes the usefulness of involving more article’s data. Also,the attentive autoencoder (AAE) can extract more construc-tive information over the variational autoencoder (VAE) andthe stacked denoising autoencoder (SDAE) as CATA has thesuperiority over CVAE and CDL, and CATA++ has the su-periority over CVAE++ while utilizing the same data.

Table 5 shows the percentage of performance improve-ment of ourmodel, CATA++, over the best competitor among



Table 4Parameter settings for �u and �v for CDL, CVAE, CVAE++, CATA, and CATA++ basedon the validation experiment.

Approach Citeulike-a Citeulike-t Citeulike-2004-2007

Sparse Dense Sparse Dense Sparse Dense

�u �v �u �v �u �v �u �v �u �v �u �vCDL 0.01 10 0.01 10 0.01 10 0.01 10 0.01 10 0.01 10CVAE 0.1 10 1 10 0.1 10 0.1 10 0.1 10 0.1 10

CVAE++ 0.1 10 0.1 10 0.1 10 0.1 10 1 10 1 10CATA 10 0.1 10 0.1 10 0.1 10 0.1 10 0.1 10 0.1

CATA++ 10 0.1 10 0.1 10 0.1 10 0.1 10 0.1 10 0.1

10 50 100 150 200 250 300K

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Reca

ll

Citeulike-a, P=1POPCDLCVAECVAE++CATACATA++

(a) Citeulike-a

10 50 100 150 200 250 300K

0.05

0.10

0.15

0.20

0.25

0.30

Reca

ll

Citeulike-t, P=1POPCDLCVAECVAE++CATACATA++

(b) Citeulike-t

10 50 100 150 200 250 300K

0.00

0.05

0.10

0.15

0.20

Reca

ll

Citeulike-2004-2007, P=1POPCDLCVAECVAE++CATACATA++

(c) Citeulike-2004-2007

Figure 3: The top-K recommendation performance based on the recall metric using the sparse cases for (a) Citeulike-a, (b)Citeulike-t, and (c) Citeulike-2004-2007 datasets.

all baselines. This percentage measures the increase in per-formance, which can be calculated according to the follow-ing formula: improv% = (pour − psota)∕psota × 100, wherepour is the performance of our model, and psota is the perfor-mance of the best model among all baselines.

In addition to the aforementioned quantitative compar-isons, qualitative comparisons are also reported in Table 6 toshow the quality of recommendations using real examples.The table shows the top-10 recommendations generated byourmodel, CATA++, and the other competitivemodel, CVAE++,

for one selected randomuser usingCiteulike-2004-2007 datasetunder the sparse setting. With this case study, we seek togain a deeper insight into the difference between the twomodels in recommendations. The example in the table presentsuser2214 has only one article in his or her training libraryentitled "A collaborative filtering framework based on fuzzyassociation rules and multiple-level similarity". This exam-ple defines the sparsity problem very well where a user haslimited feedback data. Based on the article’s title, this useris probably interested in recommender systems and more

10 50 100 150 200 250 300K

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

nDCG


(a) Citeulike-a

10 50 100 150 200 250 300K

0.02

0.04

0.06

0.08

0.10

0.12

0.14

nDCG


(b) Citeulike-t

10 50 100 150 200 250 300K

0.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

nDCG

Citeulike-2004-2007, P=1

POPCDLCVAECVAE++CATACATA++


Figure 4: The top-K recommendation performance based on the nDCG metric using the sparse cases for (a) Citeulike-a, (b)Citeulike-t, and (c) Citeulike-2004-2007 datasets.



10 50 100 150 200 250 300K

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Reca

ll


(a) Citeulike-a

10 50 100 150 200 250 300K

0.0

0.1

0.2

0.3

0.4

0.5

Reca

ll


(b) Citeulike-t

10 50 100 150 200 250 300K

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Reca

ll

Citeulike-2004-2007, P=10POPCDLCVAECVAE++CATACATA++


Figure 5: The top-K recommendation performance based on the recall metric using the dense cases for (a) Citeulike-a, (b)Citeulike-t, and (c) Citeulike-2004-2007 datasets.

10 50 100 150 200 250 300K

0.05

0.10

0.15

0.20

0.25

0.30

nDCG


(a) Citeulike-a

10 50 100 150 200 250 300K

0.05

0.10

0.15

0.20

0.25

nDCG


(b) Citeulike-t

10 50 100 150 200 250 300K

0.00

0.05

0.10

0.15

0.20

0.25

nDCG


POPCDLCVAECVAE++CATACATA++


Figure 6: The top-K recommendation performance based on the nDCG metric using the dense cases for (a) Citeulike-a, (b)Citeulike-t, and (c) Citeulike-2004-2007 datasets.

specifically in collaborative filtering (CF). After analyzingthe results of each model, we can derive that our model canrecommend more relevant articles than the other baseline.For instance, most of the top 10 recommendations based onCATA++ are related to the user’s interest. The accuracy inthis example is 0.4. Even though CVAE++ generates rele-vant articles as well, some irrelevant articles could be recom-mended as well such as the recommended article #7 entitled"Optimizing search engines using clickthrough data", whichis more about search engines than RSs. After we examinemultiple examples, we can conclude that our model identi-fies the users’ preferences more accurately, especially in thepresence of limited data.

4.4.2. RQ2To examine if the two autoencoders are cooperating with

each other in finding more similarities between users anditems, we run multiple experiments to show how each au-toencoder performs solely compared to how they perform alltogether. In other words, we compare the performance of us-ing the both autoencoders all together in a parallel way (i.e.,CATA++) against the performance of using only the rightautoencoder (i.e., CATA) that leverages the articles’ titlesand abstracts, and against the performance of using only theleft autoencoder that leverages the articles’ tags and citationsif available. Figure 7 shows the overall results. As the figureshows, the dual-way strategy has always better results than

Table 5The improvement percentage of our model’s performance over the best competitor ac-cording to Recall@10, Recall@300, nDCG@10, and nDCG@300.

Approach Sparse Dense

Recall@10 Recall@300 nDCG@10 nDCG@300 Recall@10 Recall@300 nDCG@10 nDCG@300

Citeulike-a 8.18% – 10.61% 1.57% 4.89% 3.16% – 3.01%Citeulike-t 27.42% 4.75% 22.01% 8.55% 0.84% 3.49% – 3.15%

Citeulike-2004-2007 58.36% 21.27% 20.29% 19.47% 12.88% 18.06% – 7.49%



Table 6A quality example of the top-10 recommendations using the sparse case of Citeulike-2004-2007 dataset.

User ID: 2214

Articles in training set: A collaborative filtering framework based on fuzzy association rules and multiple-level similarity

CATA++ In user library?

1. Item-based collaborative filtering recommendation algorithms No2. Combining collaborative filtering with personal agents for better recommendations No3. An accurate and scalable collaborative recommender No4. Google news personalization: scalable online collaborative filtering Yes5. Combining collaborative and content-based filtering using conceptual graphs Yes6. Link prediction approach to collaborative filtering No7. Slope one predictors for online rating-based collaborative filtering No8. Slope one predictors for online rating-based collaborative filtering Yes9. A decentralized CF approach based on cooperative agents No10. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible... Yes

CVAE++ In user library?

1. Combining collaborative filtering with personal agents for better recommendations No2. Explaining collaborative filtering recommendations No3. Google news personalization: scalable online collaborative filtering Yes4. Learning user interaction models for predicting web search result preferences No5. Item-based collaborative filtering recommendation algorithms No6. Enhancing digital libraries with TechLens+ No7. Optimizing search engines using clickthrough data No8. Context-sensitive information retrieval using implicit feedback No9. A new approach for combining content-based and collaborative filters No10. Combining collaborative and content-based filtering using conceptual graphs Yes

using each autoencoder solely except of one case in Figure7c. In addition, the performance of the left autoencoder andthe right autoencoder are competitive to each other, such thatthe right autoencoder is better than the left autoencoder inCiteulike-a dataset, while the left autoencoder is better thanthe right autoencoder in the other two datasets. We can con-clude that our model by coupling both autoencoders all to-gether is able to identify more similarities between users anditems, which leads eventually to better recommendations.4.4.3. RQ3

We conduct several experiments to find out the influenceof tuning some hyper-parameters on the performance of ourmodel, such as the dimension of the latent features, the num-ber of hidden layers of the attentive autoencoder, and thetwo regularization parameters, �u and �v, used to learn theuser/article latent features.

First, the dimension of the latent space used to report ourresults in the previous section is 50, i.e., each user and itemlatent feature, ui and vj , is a vector of size 50. We use theexact number as the state-of-the-art approach, CVAE, in or-der to have fair comparisons. However, to see the impact ofdifferent dimension sizes, we repeat our whole experimentsby changing the size into one of following values {25, 50,100, 200, 400}. In other words, we set the size of the latentfactors of PMF and the size of the bottleneck of the attentiveautoencoder to one of these values. As a result, on average,we observe that when the dimension size is equal to 200, our

model has the best performance among all three datasets asFigure 8a shows. Generally, setting the latent space with sizebetween 100 and 200 is enough to have a reasonable perfor-mance compared to the other values.

Second, a four-layer network is used to construct ourAAE when we report our results previously. The four-layernetwork has a shape of "#Vocabularies-400-200-100-50-100-200-400-#Vocabularies". However, we again repeat thewholeexperiments with different number of layers starting fromtwo to five layers, such that each layer has a half size of theprevious one. As Figure 8b shows, using less than three lay-ers are not enough to learn the side information. Generally,three-layer and four-layer networks are good enough to trainour model.

Third, we repeat the experiment again with different val-ues of �u and �v from the following range {0.01, 0.1, 1,10, 100}. Figures 9a and 9c show the performance for thesparse data of Citeulike-a and Citeulike-t datasets, respec-tively. From these two figures, using a lower value of �v typ-ically results in lower performance, meaning the user feed-back data is not enough and the model needs more articleinformation. The same thing can be said to both scenarios ofCiteulike-2004-2007 dataset in Figures 9e and 9f. In Figure9e, higher value of �u decrease the performance where userfeedback is scarce. Even though Figure 9f shows the per-formance under the dense setting for Citeulike-2004-2007dataset, it still exemplifies the sparsity with regard to arti-



10 50 100 150 200 250 300K

0.05

0.10

0.15

0.20

0.25

0.30

Reca

ll

Citeulike-a, P=1LeftRightDual

(a) Citeulike-a, P=1

10 50 100 150 200 250 300K

0.1

0.2

0.3

0.4

0.5

0.6

Reca

ll

Citeulike-a, P=10LeftRightDual

(b) Citeulike-a, P=10

10 50 100 150 200 250 300K

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Reca

ll

Citeulike-t, P=1LeftRightDual

(c) Citeulike-t, P=1

10 50 100 150 200 250 300K

0.1

0.2

0.3

0.4

0.5

Reca

llCiteulike-t, P=10

LeftRightDual

(d) Citeulike-t, P=10

10 50 100 150 200 250 300K

0.00

0.05

0.10

0.15

0.20

Reca

ll

Citeulike-2004-2007, P=1LeftRightDual

(e) Citeulike-2004-2007, P=1

10 50 100 150 200 250 300K

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Reca

ll

Citeulike-2004-2007, P=10LeftRightDual

(f) Citeulike-2004-2007, P=10

Figure 7: The performance results of using left autoencoder vs.right autoencoder compared to the use of both autoencodersall together for (a-b) Citeulike-a, (c-d) Citeulike-t, and (e-f)Citeulike-2004-2007 datasets.

25 50 100 200 400d

0.08

0.10

0.12

0.14

0.16

0.18

0.20

0.22

Reca

ll@10

0


(a) Latent space dimension

2 3 4 5#Layers

0.00

0.05

0.10

0.15

0.20

Reca

ll@10

0


(b) Number of layers

Figure 8: The impact of hyper-parameters tuning on CATA++performance for: (a) dimension of features’ latent space, and(b) number of layers inside each encoder and decoder.

cles as we indicate before in Figure 2, where 80% of the ar-ticles are only added to one user library. On the other handwhere user feedback is considerably enough, higher value of�v results in lower performance as Figures 9b and 9d show.

5. ConclusionIn this paper, we alleviate the natural data sparsity prob-

lem in recommender systems by introducing a dual-way strat-egy to learn item’s textual information by coupling two par-

0.01 0.1 1 10 100v

100

101

0.1

0.01

u

Citeulike-a, P=1

0.08

0.09

0.10

0.11

0.12

Reca

ll@50

(a) Citeulike-a, P=1

0.01 0.1 1 10 100v

100

101

0.1

0.01

u

Citeulike-a, P=10

0.20

0.21

0.22

0.23

0.24

0.25

Reca

ll@50

(b) Citeulike-a, P=10

0.01 0.1 1 10 100v

100

101

0.1

0.01

u

Citeulike-t, P=1

0.120

0.128

0.136

0.144

0.152

Reca

ll@50

(c) Citeulike-t, P=1

0.01 0.1 1 10 100v

100

101

0.1

0.01

u

Citeulike-t, P=10

0.195

0.210

0.225

0.240

Reca

ll@50

(d) Citeulike-t, P=10

0.01 0.1 1 10 100v

100

101

0.1

0.01

u


0.070

0.075

0.080

0.085

0.090

Reca

ll@50

(e) Citeulike-2004-2007, P=1

0.01 0.1 1 10 100v

100

101

0.1

0.01

u


0.105

0.120

0.135

0.150

0.165

Reca

ll@50

(f) Citeulike-2004-2007, P=10

Figure 9: The impact of �u and �v on CATA++ performancefor (a-b) Citeulike-a, (c-d) Citeulike-t, and (e-f) Citeulike-2004-2007 datasets.

allel attentive autoencoders together. The learned item’s fea-tures are then utilized in the learning process of matrix fac-torization (MF). We evaluate our model for academic articlerecommendation task using three real-world datasets. Thehuge gap in the experimental results validates the usefulnessof exploiting more item’s information, and the benefit of in-tegrating attention technique in finding more relevant rec-ommendations, and thus boosting the recommendation ac-curacy. As a result, ourmodel, CATA++, has the superiorityover multiple state-of-the-art MF-based methods accordingto several evaluation measurements. Furthermore, the per-formance of CATA++ is improved the most where the datasparsity is the highest.

For future work, new metric learning algorithms couldbe explored to substituteMF technique because the dot prod-uct in MF doesn’t guarantee the triangle inequality [6]. Forany three items, the triangle inequality is fulfilled once thesum of distance between any two item pairs in the featurespace should be greater or equal to the distance of the thirditem pair, such that d(x, y) ≤ d(x, z) + d(z, y). By doingso, user-user and item-item relationships might be capturedmore accurately.



CRediT authorship contribution statementMeshal Alfarhood: Conceptualization, Methodology,

Software, Validation,Writing - original draft. JianlinCheng:Conceptualization, Supervision, Resources,Writing - review& editing.

References[1] Alfarhood, M., Cheng, J., 2019. Collaborative attentive autoencoder

for scientific article recommendation, in: 2019 18th IEEE Interna-tional Conference OnMachine Learning And Applications (ICMLA),IEEE. pp. 168–174.

[2] Alzogbi, A., 2018. Time-aware collaborative topic regression:Towards higher relevance in textual item recommendation., in:BIRNDL@ SIGIR, pp. 10–23.

[3] Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine trans-lation by jointly learning to align and translate. arXiv preprintarXiv:1409.0473 .

[4] He, X., He, Z., Song, J., Liu, Z., Jiang, Y., Chua, T., 2018. Nais:Neural attentive item similarity model for recommendation. IEEETransactions on Knowledge and Data Engineering 30, 2354–2366.

[5] Hinton, G.E., Salakhutdinov, R., 2006. Reducing the dimensionalityof data with neural networks. science 313, 504–507.

[6] Hsieh, C., Yang, L., Cui, Y., Lin, T., Belongie, S., Estrin, D., 2017.Collaborative metric learning, in: Proceedings of the 26th interna-tional conference on world wide web, International World Wide WebConferences Steering Committee. pp. 193–201.

[7] Hu, Y., Koren, Y., Volinsky, C., 2008. Collaborative filtering for im-plicit feedback datasets., in: ICDM, Citeseer. pp. 263–272.

[8] Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167 .

[9] Jhamb, Y., Ebesu, T., Fang, Y., 2018. Attentive contextual denoisingautoencoder for recommendation, in: Proceedings of the 2018 ACMSIGIR International Conference on Theory of Information Retrieval,ACM. pp. 27–34.

[10] Jing, L., Wang, P., Yang, L., 2015. Sparse probabilistic matrix factor-ization by laplace distribution for collaborative filtering, in: Twenty-Fourth International Joint Conference on Artificial Intelligence.

[11] Koren, Y., Bell, R., Volinsky, C., 2009. Matrix factorization tech-niques for recommender systems. Computer , 30–37.

[12] Li, S., Kawale, J., Fu, Y., 2015. Deep collaborative filtering viamarginalized denoising auto-encoder, in: Proceedings of the 24thACM International on Conference on Information and KnowledgeManagement, ACM. pp. 811–820.

[13] Li, X., She, J., 2017. Collaborative variational autoencoder for rec-ommender systems, in: Proceedings of the 23rd ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining,ACM. pp. 305–314.

[14] Ma, C., Kang, P., Wu, B., Wang, Q., Liu, X., 2019. Gated attentive-autoencoder for content-aware recommendation, in: Proceedings ofthe Twelfth ACM International Conference on Web Search and DataMining, ACM. pp. 519–527.

[15] McAuley, J., Leskovec, J., 2013. Hidden factors and hidden topics:understanding rating dimensions with review text, in: Proceedings ofthe 7th ACM conference on Recommender systems, ACM. pp. 165–172.

[16] Mnih, A., Salakhutdinov, R., 2008. Probabilistic matrix factorization,in: Advances in neural information processing systems, pp. 1257–1264.

[17] Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K., 2014. Recurrentmodels of visual attention, in: Advances in neural information pro-cessing systems, pp. 2204–2212.

[18] Pan, R., Zhou, Y., Cao, B., Liu, N.N., Lukose, R., Scholz, M., Yang,Q., 2008. One-class collaborative filtering, in: 2008 Eighth IEEEInternational Conference on Data Mining, IEEE. pp. 502–511.

[19] Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L., 2009.

Bpr: Bayesian personalized ranking from implicit feedback, in: Pro-ceedings of the twenty-fifth conference on uncertainty in artificial in-telligence, AUAI Press. pp. 452–461.

[20] Salakhutdinov, R., Mnih, A., 2008. Bayesian probabilistic matrix fac-torization using markov chain monte carlo, in: Proceedings of the25th international conference on Machine learning, ACM. pp. 880–887.

[21] Salakhutdinov, R., Mnih, A., Hinton, G., 2007. Restricted boltzmannmachines for collaborative filtering, in: Proceedings of the 24th inter-national conference on Machine learning, ACM. pp. 791–798.

[22] Salton, G., McGill, M., 1983. Introduction to modern informationretrieval. McGraw-Hill, Inc.

[23] Seo, S., Huang, J., Yang, H., Liu, Y., 2017. Interpretable convolu-tional neural networks with dual local and global attention for reviewrating prediction, in: Proceedings of the Eleventh ACM Conferenceon Recommender Systems, ACM. pp. 297–305.

[24] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdi-nov, R., 2014. Dropout: a simple way to prevent neural networks fromoverfitting. The Journal of Machine Learning Research 15, 1929–1958.

[25] Tay, Y., Luu, A., Hui, S., 2018. Multi-pointer co-attention networksfor recommendation, in: Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining,ACM. pp. 2309–2318.

[26] Wang, C., Blei, D., 2011. Collaborative topic modeling for rec-ommending scientific articles, in: Proceedings of the 17th ACMSIGKDD international conference on Knowledge discovery and datamining, ACM. pp. 448–456.

[27] Wang, H., Chen, B., Li, W., 2013. Collaborative topic regressionwith social regularization for tag recommendation, in: Twenty-ThirdInternational Joint Conference on Artificial Intelligence.

[28] Wang, H., Wang, N., Yeung, D., 2015. Collaborative deep learningfor recommender systems, in: Proceedings of the 21thACMSIGKDDInternational Conference on Knowledge Discovery and Data Mining,ACM. pp. 1235–1244.

[29] Xiao, J., Ye, H., He, X., Zhang, H., Wu, F., Chua, T., 2017. Atten-tional factorization machines: learning the weight of feature interac-tions via attention networks, in: Proceedings of the 26th InternationalJoint Conference on Artificial Intelligence, AAAI Press. pp. 3119–3125.


Documents

CATA++: A Collaborative Dual Attentive Autoencoder Method ... · CATA++: ACollaborativeDualAttentiveAutoencoderMethodforRecommendingScientiﬁcArticles ourmodel. 3.2. Theattentiveautoencoder