14
An improved privacy-preserving DWT-based collaborative filtering scheme Alper Bilge, Huseyin Polat Computer Engineering Department, Anadolu University, 26470 Eskisehir, Turkey article info Keywords: Privacy DWT Collaborative filtering Accuracy Performance Preprocessing abstract Collaborative filtering (CF) is one of the most efficient techniques to produce personalized recommenda- tions and to deal with the information overload of modern times. Although CF techniques have immen- sely useful filtering capabilities, many CF systems have challenging problems like scalability, accuracy, and privacy. One approach to enhance scalability of such systems is to apply discrete wavelet transfor- mation (DWT) techniques. DWT-based CF schemes significantly overcome the scalability problem. How- ever, they fail to protect individual users’ privacy. Moreover, although such schemes provide accurate predictions, the quality of the recommendations provided by DWT-based CF schemes can be further improved by applying some preprocessing methods. In this study, we propose privacy-preserving schemes to produce accurate predictions based on DWT efficiently without deeply exposing customers’ privacy. We also recommend methods to order items before applying DWT to boost accuracy. After evaluating our schemes in terms of privacy and supplemen- tary costs, we perform real data-based experiments to scrutinize the proposed schemes in terms of accu- racy. Experimental results show that our privacy-preserving methods are able to offer recommendations with decent accuracy. Moreover, our outcomes show that our methods utilized to sort items improve accuracy. We finally provide some suggestions and explain future works. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction The rapid development of Web technologies and the Internet facilities lead to a time of changing in individuals’ life styles. With increasing easiness of access to the Internet and attracting opportu- nities offered by online vendors, users long started to make use of online warehouse and amenity shopping services, which are much easier and effective than discretionary shopping (McLaughlin, 2003). However, as those facilities pervade and people become familiar with them, two problems arise for online vendors. First, to be able to compete in the market, they need to attract attentions of customers, which get harder since the number of those services increases constantly. Second, as more people get oriented in such services, it gets difficult to deal with the information arising from each individual. Providing personalized recommendations to assist customers find products efficiently and rendering shopping a more entertaining activity help e-commerce sites handle both problems. Many stores and shopping malls employ such personalized predic- tion services. Most of those services suggest products of potential interest to a customer whenever a product is clicked, put into a shopping cart, or purchased. Some online vendors even provide per- sonalized advertisements to customers according to their antici- pated individual preferences while they are surfing over the site. Although there are various methods for providing personalized recommendations, one of the most utilized ones is called collabora- tive filtering (CF) (Ahn, Kang, & Lee, 2010), where past preferences of users are recorded and then used to determine the similarity among customers. For a given user, referred to as the active user (a), the most similar users are determined and using their prefer- ences on a given product, called the target item (q), a recommenda- tion is estimated for a on q (p aq ). Several studies prove that CF performs well and there are a number of successful adoptions in real world applications (Ahmad & Khokhar, 2007; Kohrs & Merialdo, 2001; Symeonidis, Nanopoulos, Papadopoulos, & Manolopoulos, 2008; Ziqiang & Boqin, 2004). CF relies on an n m matrix, commonly called user-item matrix, which holds preferences of n users on m items and all calculations are performed on this matrix. There are some common limitations of CF schemes (Chen, Wang, & Zhang, 2009; Merve Acilar & Arslan, 2009; Roh, Oh, & Han, 2003). First, as n and/or m increases over time, the amount of data to be handled increases correspondingly and causes scalability problems. Second, CF systems are supposed to produce accurate predictions to satisfy customers. Users usually prefer those sites with precise and dependable recommendations. Inaccurate referrals lead angry customers. Third and the most important challenge is that many CF schemes fail to protect users’ privacy. Due to privacy concerns, customers might decide to give 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.09.094 Corresponding author. Tel.: +90 222 321 3550; fax: +90 222 323 9501. E-mail addresses: [email protected] (A. Bilge), [email protected] (H. Polat). Expert Systems with Applications 39 (2012) 3841–3854 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

An improved privacy-preserving DWT-based collaborative filtering scheme

Embed Size (px)

Citation preview

  • se

    is oinfoilitietoDWndivthe

    In this study, we propose privacy-preserving schemes to produce accurate predictions based on DWT

    technon indivternetrs longping se

    entertaining activity help e-commerce sites handle both problems.Many stores and shopping malls employ such personalized predic-tion services. Most of those services suggest products of potentialinterest to a customer whenever a product is clicked, put into ashopping cart, or purchased. Some online vendors even provide per-

    are performed on this matrix. There are some common limitationsof CF schemes (Chen, Wang, & Zhang, 2009; Merve Acilar & Arslan,2009; Roh, Oh, & Han, 2003). First, as n and/or m increases overtime, the amount of data to be handled increases correspondinglyand causes scalability problems. Second, CF systems are supposedto produce accurate predictions to satisfy customers. Users usuallyprefer those sites with precise and dependable recommendations.Inaccurate referrals lead angry customers. Third and the mostimportant challenge is that many CF schemes fail to protect usersprivacy. Due to privacy concerns, customers might decide to give

    Corresponding author. Tel.: +90 222 321 3550; fax: +90 222 323 9501.E-mail addresses: [email protected] (A. Bilge), [email protected] (H.

    Expert Systems with Applications 39 (2012) 38413854

    Contents lists available at

    Expert Systems w

    .ePolat).easier and effective than discretionary shopping (McLaughlin,2003). However, as those facilities pervade and people becomefamiliar with them, two problems arise for online vendors. First,to be able to compete in the market, they need to attract attentionsof customers, which get harder since the number of those servicesincreases constantly. Second, as more people get oriented in suchservices, it gets difcult to deal with the information arising fromeach individual. Providing personalized recommendations to assistcustomers nd products efciently and rendering shopping a more

    (a), the most similar users are determined and using their prefer-ences on a given product, called the target item (q), a recommenda-tion is estimated for a on q (paq). Several studies prove that CFperforms well and there are a number of successful adoptions inreal world applications (Ahmad& Khokhar, 2007; Kohrs &Merialdo,2001; Symeonidis, Nanopoulos, Papadopoulos, & Manolopoulos,2008; Ziqiang & Boqin, 2004).

    CF relies on an n mmatrix, commonly called user-itemmatrix,which holds preferences of n users on m items and all calculations1. Introduction

    The rapid development of Webfacilities lead to a time of changing iincreasing easiness of access to the Innities offered by online vendors, useonline warehouse and amenity shop0957-4174/$ - see front matter 2011 Elsevier Ltd. Adoi:10.1016/j.eswa.2011.09.094efciently without deeply exposing customers privacy. We also recommend methods to order itemsbefore applying DWT to boost accuracy. After evaluating our schemes in terms of privacy and supplemen-tary costs, we perform real data-based experiments to scrutinize the proposed schemes in terms of accu-racy. Experimental results show that our privacy-preserving methods are able to offer recommendationswith decent accuracy. Moreover, our outcomes show that our methods utilized to sort items improveaccuracy. We nally provide some suggestions and explain future works.

    2011 Elsevier Ltd. All rights reserved.

    logies and the Internetiduals life styles. Withand attracting opportu-started to make use ofrvices, which are much

    sonalized advertisements to customers according to their antici-pated individual preferences while they are surng over the site.

    Although there are various methods for providing personalizedrecommendations, one of the most utilized ones is called collabora-tive ltering (CF) (Ahn, Kang, & Lee, 2010), where past preferencesof users are recorded and then used to determine the similarityamong customers. For a given user, referred to as the active userimproved by applying some preprocessing methods.An improved privacy-preserving DWT-ba

    Alper Bilge, Huseyin Polat Computer Engineering Department, Anadolu University, 26470 Eskisehir, Turkey

    a r t i c l e i n f o

    Keywords:PrivacyDWTCollaborative lteringAccuracyPerformancePreprocessing

    a b s t r a c t

    Collaborative ltering (CF)tions and to deal with thesely useful ltering capaband privacy. One approachmation (DWT) techniques.ever, they fail to protect ipredictions, the quality of

    journal homepage: wwwll rights reserved.d collaborative ltering scheme

    ne of the most efcient techniques to produce personalized recommenda-rmation overload of modern times. Although CF techniques have immen-s, many CF systems have challenging problems like scalability, accuracy,enhance scalability of such systems is to apply discrete wavelet transfor-T-based CF schemes signicantly overcome the scalability problem. How-idual users privacy. Moreover, although such schemes provide accuraterecommendations provided by DWT-based CF schemes can be further

    SciVerse ScienceDirect

    ith Applications

    lsevier .com/locate /eswa

  • ith Afalse data or refuse to give any data. CF services then provided onsuch insufcient and false data are more likely inaccurate anduntrustworthy.

    To overcome scalability problem, various approaches have beenproposed. Chen et al. (2009) propose to use orthogonal nonnega-tive matrix tri-factorization and simultaneously clustering rowsand columns of the user-item matrix to alleviate scalability prob-lems. In order to improve scalability, Feng and Hui-you (2006) offera genetic clustering method. Russell and Yoon (2008) apply dis-crete wavelet transformation (DWT) to recommender systems,where data are transformed and reduced signicantly to decreasethe amount of time for producing a prediction. Han, Xie, Yang,and Shen (2004) propose a distributed CF algorithm along with sig-nicance renement and unanimous amplication for improvedscalability. Al-Shamri and Bharadwaj (2008) propose applying fuz-zy sets on a hybrid model constructed using advantages of mem-ory-based approaches accuracy and model-based approachesscalability.

    In addition to improving scalability, different techniques havebeen utilized to enhance the quality of the predictions. Bogdanovaand Georgieva (2008) propose a new algorithm based on discover-ing the functional error-correcting dependencies in a dataset byusing the fractal dimension for tackling accuracy problem of CFsystems. Liu, Jia, Zhou, Sun, and Wang (2009) propose a modiedCF method computing the similarity between congeneric nodesin bipartite networks substituting standard cosine similarity andachieve a signicant improvement of algorithmic accuracy. Jeong,Lee, and Cho (2010) propose a similarity update method that usesan iterative message passing procedure and an accuracy metric inorder to minimize the predictive accuracy error and to evenly dis-tribute predicted ratings over true rating scales.

    Privacy has been receiving increasing attention with increasingpopularity of e-commerce. Users preferences about various itemsand the products they bought are considered private. CF servicescollect buying information about their customers and build per-sonal proles, which might cause different privacy risks (Cranor,Reagle, & Ackerman, 1999; Jensen, Potts, & Jensen, 2005). Examplesof such risks include unsolicited marketing, proling users, pricediscrimination, being subject to government surveillance, datatransferring, and so on. Due to privacy risks, customers hesitateto give their preferences about products they bought; or give falsedata. However, it is not an easy task to produce referrals from suchinadequate and false data. To provide predictions with privacy,various privacy-preserving schemes have been proposed (Canny,2002a, 2002b; Polat & Du, 2003).

    Although Russell and Yoon (2008) propose a DWT-based CFscheme, they do not consider privacy preservation. Moreover, theirscheme directly applies DWT to the naturally ordered data items.In this study, we investigate how to produce DWT-based referralsfrom masked data. In other words, we propose privacy-preservingschemes in which users mask their ratings and rated items beforethey send them to the data collector. We propose to use random-ized perturbation techniques (RPT) to mask users private data.However, such data disguising should still allow CF systems pro-viding DWT-based predictions with decent accuracy. We also pro-pose methods to order items before applying DWT so that accuracycan be improved. We hypothesize that items ordering might affectthe information losses in DWT. If we are able to order the items insuch a way to reduce information losses, the quality of the predic-tions can be improved without sacricing on scalability. We inves-tigate our schemes in terms of privacy. Due to privacy measures,supplementary costs are inevitable. Thus, we also analyze the pro-posed schemes in terms of such costs. We nally perform real data-

    3842 A. Bilge, H. Polat / Expert Systems wbased experiments to assess accuracy changes and give futuredirections.2. Related work

    The success of CF algorithms mainly depends on data sparsityand size of the data they work on. The performance of any CF ap-proach immediately decreases as data get sparse, which habituallyhappens in most web applications. Researchers have long beenfocusing on this issue and they have made a signicant progressto handle it. Examples of such approaches to alleviate sparsity in-clude using articial immune networks (Acilar & Arslan, 2009), fac-torization of user-item matrix via orthogonality properties (Chenet al., 2009), employing a random walk recommender (Yildirim &Krishnamoorthy, 2008), utilizing a hybrid approach consisting ofboth user- and item-based CF by dening a similarity weight(Zhang, Xiao, & Guo, 2008), applying back propagation neural net-works to predict values of null ratings on users whose non-null rat-ings intersect the most (Feng & Hui-You, 2005), a new algorithm toeffectively predict missing ratings by setting a similarity thresholdfor both users and items (Ma, King, & Lyu, 2007).

    While research on reducing sparsity has evolved considerably,scalability problem seems to be a more challenging issue due toconstantly increasing nature of data. Todays recommender sys-tems can respond to even thousands of customers in a reasonableamount of time; however, this capability will not be enough for fu-tures rapidly evolving systems. Among approaches to overcomescalability problem, the most widely adopted one is dimensionalityreduction in which irrelevant users and/or items are eliminated toincrease scalability consequently. Zhang, Wang, Ford, Makedon,and Pearlman (2005) propose using singular value decomposition(SVD) as a dimensionality reduction method and Vozalis andMargaritis (2007) implement demographic data along with SVDfor enhancing scalability. Other approaches of dimensionalityreduction employed to solve scalability are principal componentanalysis (PCA) and correspondence analysis (CA) (Kim & Yum,2005; Vozalis, Markos, & Margaritis, 2009). Another approach onscalability problem is clustering in which neighborhood selectionis performed according to previously clustered groups of custom-ers. One example of these approaches combines case-based rea-soning with self organizing map (SOM) clustering (Roh et al.,2003) and another example utilizes smoothing-based clustering(Xue et al., 2005). However; all these approaches are computation-ally expensive limiting the benets of data reduction to be han-dled. Also, they tend to be ineffective with extremely largedatasets. To overcome pitfalls of such approaches, DWT has beenproposed to be employed in CF systems. DWT has a wide applica-tion area in science, engineering, mathematics, and computer sci-ence; such as audio and video compression (Kumari &Sadasivam, 2007), object recognition (Cheng & Chen, 2006), VLSIdesigns and architectures (Bum Pan & Park, 2002), and numericalanalysis (Plas, Moor, & Waelkens, 2008). Furthermore, DWT hasan ability to be combined with previously discussed dimensional-ity reduction techniques in data mining area; i.e., SVD (Zhao & Ye,2009), PCA (Korrek & Nizam, 2010), and clustering (Yu & Kamar-thi, 2010). The only application of DWT to recommender systems isperformed by Russell and Yoon (2008) to reduce data efcientlyand increase performance in large datasets without compromisingaccuracy of the system. Although their scheme enhances the per-formance of the system, it fails to preserve privacy and directly ap-plies DWT to naturally ordered items.

    Providing personalized referrals while preserving individualsprivacy has been an area of research because in todays infrastruc-ture of web technologies, it is signicantly important to make cus-tomer feel condent about her private information. Cannyproposes two schemes for this purpose (Canny, 2002a, 2002b). In

    pplications 39 (2012) 38413854those schemes, users iteratively compute a public aggregate oftheir data adding vectors of user data employing homomorphic

  • ing individual data. Cannys schemes are very appropriate for dis-

    them as approximation coefcient holding a very large portion of

    ith Ait. With this feature of DWT, consecutive transformations can beperformed proceeding exclusively on approximation coefcientand hereby obtaining a very much compact form of original datawith a sacrice of negligible amount of information. Since DWTprocess halves the size of an original user-item matrix of sizen m, applying transformation k times, a compact form of sizen (m/2k) is obtained.

    In Russell and Yoon (2008), such a model (the reduced data) isused to determine similarities among users. Similarity calculationbetween two users is calculated using Pearsons correlation coef-cient. In the sense of such an approach to determine similarities,whole process is accelerated remarkably since the number of itemsdecreases much, which provides a level of scalability to the wholesystem. After determining similarities and choosing the best neigh-bors to a particular user, items with the highest predicted ratingsare provided as a recommendation using adjusted weighted sumCF method. The approach presented in Russell and Yoon (2008)is shown to perform well in terms of both accuracy and perfor-mance. They prove by experiments that DWT is a solution to over-come scalability and accuracy problems of CF systems.

    4. Privacy-preserving DWT-based CF schemetributed systems such as peer-to-peer (P2P) or multi-partyschemes. Another approach for preserving-privacy of centralizedsystems with one central server conducting CF processes is to ap-ply RPT. Polat and Du utilize RPT for solving the problem of pri-vacy-preserving collaborative ltering in Polat and Du (2003,2006) and Yakut & Polat (2007)) apply it to a constant time CF algo-rithm, Eigentaste, to achieve users privacy. Kaleli & Polat (2007a,2007b) propose applying randomized response techniques (RRT)with nave Bayesian classier (NBC) to produce private referrals.While it is proven that randomized approaches can handle pri-vacy-preservation problem in CF systems, there are new ap-proaches such as DWT-based data reduction to deal with otherproblems of CF, e.g. scalability, for which achieving individual pri-vacy is not examined. In this study, we propose a privacy-preserv-ing scheme to produce DWT-based referrals without greatlyjeopardizing users privacy. We investigate the consequences ofapplying RPT to DWT-based CF scheme in terms of accuracy andprivacy. We also propose a new scheme to enhance the accuracyof the privacy-preserving scheme by ordering items to minimizeinformation loss.

    3. DWT-based collaborative ltering

    Russell and Yoon (2008) use Haar wavelets to reduce theamount of items in order to achieve scalability in providing recom-mendation process. DWT-based CF approach basically divides theoriginal user-item matrix into two components for each pair calledapproximation and detail coefcients, using the followingequation:

    Cappx uj uj12

    p ; Cdtl uj uj12

    p : 1

    While approximation coefcient is a time domain expression oforiginal data, detail coefcient is a frequency domain perspectiveof it, where both of them consist of half the size of items comparedto original user-item matrix. However, there is a much differencebetween coefcients in the perspective of information held byencryption to privately encrypt and decrypt vectors without expos-

    A. Bilge, H. Polat / Expert Systems wIn this section, we explain our proposed scheme in detail. Werst dene how to perturb individual users data. Then, we explainhow to reduce masked data using DWT. We nally elucidate onlinerecommendation process from perturbed data.

    4.1. Individual user privacy protection by randomization

    The accuracy of recommender systems is directly related to thequality of collected data, which cannot be gathered fromuserswith-out satisfying their privacy requirements. Todays recommendersystems are obliged to collapse unless they provide a measure ofprivacy to customers. However, privacy and accuracy are conictinggoals because preserving privacy requires a level of distortion in ori-ginal data, which in turn yields a decrease in accuracy of recom-mender systems. Since both of these measures should be providedby a recommender system, it is important to adjust the level of dis-tortion such that any data holder do not need to reveal valuable and/ormeaningful information and yet reliable and accurate recommen-dations can be produced from collected disguised data.

    RPTs are such methods providing a pretty sufcient level of pri-vacy for users while helping recommender systems produce accu-rate and dependable recommendations. The basic idea behind RPTis to hide a value by adding a random number. These methods offerto disguise a number, x, by adding a random value r drawn fromsome pre-dened distribution to it, where x + r, rather than x, takesplace in the database. Since recommendation generation process isperformed on aggregate data rather than individual data items,although data collector cannot derive any meaningful informationfrom individual disguised data items, it can still conduct certaincomputations on aggregated data. The randomization process insuch a system perturbs data items so that the server can only iden-tify the range of masked data, which is adequately expanded topreserve privacy.

    With privacy as a concern, there are two aspects of preservingindividuals privacy in a recommender system, i.e. preventing theserver or the data collector to learn true preferences of users anditems, which are actually rated by users. In other words, users wantto protect their true ratings about various items and hide their ratedand/or unrated items. To achieve the rst goal, users generate ran-dom numbers and add them to their votes so that the server cannotlearn true ratings. They also insert random numbers into some oftheir unrated item cells to mask their rated and/or unrated items.

    In order to generate random numbers, users can either use auniform or Gaussian distribution. In uniform distribution, theygenerate uniform random values in a range [a,a], where a is aconstant. Similarly, random numbers are generated using normaldistribution with mean (l) being zero and standard deviation (r)in Gaussian distribution. In short, users create random numberswith l = 0 and r, where a

    3

    pr. After generating random num-

    bers, all users (i) disguise their private ratings with random num-bers by adding them to the true votes and (ii) mask their rated and/or unrated items by inserting random values into some of their un-rated item cells before sending their vectors to the server.

    To offer DWT-based recommendations, the server needs nor-malized ratings. Thus, before perturbing their data, users rst nor-malize their votes by transforming them into z-scores. Althoughnormalization changes rating scales unpredictably; however, it stilldoes not prevent predicting actual ratings. This transformationprocess helps users disguise their private data with random num-bers generated from a narrower interval. Next, each user u uni-formly randomly chooses a ru value over the range [0,rmax],where rmax is the maximum standard deviation value pre-deter-mined by the server. Then, to determine the number of lled cellswith random numbers, bu value is uniformly randomly selectedover the range [0,bmax] by each user u, where bmax stands for the

    pplications 39 (2012) 38413854 3843maximum unrated item cells lling percent value pre-determinedby the server. Finally, each user generates random numbers using arandom number distribution based on the chosen values; and

  • transfconverando

    hoodcompu

    A.

    B.

    ith Adisguises their ratings vectors. Notice that the range of randomnumbers and the number of unrated item cells to be lled withrandom numbers depend on the values of rmax and bmax, respec-tively. Their values affect privacy and accuracy; and their optimumvalues can be determined experimentally. Data disguising and col-lection steps can be enumerated, as follows:

    1. The server determines the values of parameters rmax and bmax;and let each user know.

    2. Each user u computes their ratings average and standard devi-ation; and transforms the votes into z-scores.

    3. Users then determine number of unrated items (Mu) in theirratings vectors.

    4. After uniformly randomly selecting bu, each user u uniformlyrandomly chooses bbuMu=100c unrated item cells to be lled.

    5. Each user u selects ru; and decides random number distribu-tion. To do so, users create a random number over the range[0,1]. If it is less than 0.5, they utilize uniform distribution;otherwise, they use Gaussian distribution to generate randomnumbers.

    6. Using the selected distribution, they generate Ru =mu +bbuMu=100c random numbers (ruj values for j = 1, 2, . . . ,Ru),where mu is the number of user us actual ratings.

    7. Each user u nally disguises their z-scores and obtains z0uj =zuj + ruj; and lls the selected unrated item cells with corre-sponding random numbers.

    8. After data disguising, each user sends her disguised vector tothe server, which creates an n m disguised user-item matrix,D0. The server can now start providing predictions to their cus-tomers based on D0.

    Any active user needs to send her available data and a query tothe server in order to get a prediction. Therefore, their privacyshould be preserved, as well. Each active user a can similarly dis-guise her individual private data; and then sends the masked vec-tor to the server.

    4.2. Reducing D0 using discrete wavelet transformation

    As explained by Russell and Yoon (2008), a Haar-based DWTproduces an approximation coefcient for each pair in the distribu-tion being transformed. Given a user-item matrix, D, transformedrecommender ratings matrix can be easily obtained. The servercan transform the disguised user-item matrix, D0, as follows, usingthe Haar-based DWT:

    1. The server rst lls empty cells with corresponding user aver-ages to obtain a dense set. To do that, it should estimate such val-ues from disguised data. The average z-scores can be estimatedfrom perturbed data for each user, as follows:

    z0u PRu

    j1z0uj

    RuPRu

    j1zuj rujRu

    PRu

    j1zujRu

    PRu

    j1rujRu

    PRu

    j1zujRu

    : 2

    Since the random numbers are drawn from a distribution with lbeing zero, the expected value of their average is zero. With increas-ing Ru values, the observed value of random number averages willbecome closer to zero. Thus, the server can estimate averages ofmasked z-scores, lls the empty cells; and obtains a dense set.2. It then estimates approximation coefcients using the Haar-based DWT. Such coefcients for each pair of masked z-scorescan be estimated, as follows:

    z0appxu z0uj z0uj1

    2p zuj ruj zuj1 ruj1

    2p

    3844 A. Bilge, H. Polat / Expert Systems w zuj zuj12

    p ruj ruj12

    p zuj zuj12

    p : 3and selecting the best similar users. The similarity weightsare estimated based on Pearsons correlation coefcientfrom masked and transformed data, as follows:

    W 0au Xm0j1

    z0ajz0uj

    Xm0j1

    zaj r0ajzuj r0uj

    Xm0j1

    zajzuj Xm0j1

    zajr0uj Xm0j1

    zujr0aj Xm0j1

    r0ajr0uj;

    w0au Xm0j1

    zajzuj: 4

    Notice that r0a and r0u values are noise data left after data transforma-

    tion. Since the transformed data is dense and reduced from m to m,the summations are over m. The expected values of the last threesummations in Eq. (4) are zero because random numbers are gener-ated using a distribution with l being zero. Similarly, the expectedmeans of the z-scores are zero, as well. Thus, the server can estimatethe similarities between users from perturbed data with decentaccuracy. After estimating similarities, the server can select the bestN similar users as neighbors (top-N users) or lters users whosesimilarity is beneath a pre-determined threshold value, as ex-plained in Russell and Yoon (2008). Once the server forms as neigh-borhood, it can now estimate a prediction for her.

    C. Recommendation estimation: Since users including activeusers send normalized ratings to the server, paq (the predictionfor a on item q) can be estimated, as follows:

    p0aq va ra PNa

    u1w0auz

    0uqPNa

    u1w0au va ra P0aq; 5

    where N0a shows the number of as neighbors who rated q, va and rarepresent as mean vote and her ratings standard deviation, respec-

    0tively.ceivesestimaTransformation of as data: Since a normalizes her ratings anddisguises her z-scores and rated and/or unrated items likeother users do, the server can similarly ll her empty cellsand transform her data by applying DWT.Formation ofneighborhood: This step contains two tasks: esti-mating similarities between a and each user in the databaseformation, and prediction estimation. The details of suchtations are explained in the following.only one masked value, some may have undisguised pairs. In allcases, due to the nature of random number distribution and datatransformation, approximation coefcient can be estimated fromperturbed data.

    3. After performing data transformation one to three levels, theserver nally gets D00 with size n mt including transformedmasked z-scores.

    The computations so far are conducted off-line. The server cannow estimate predictions for active users online.

    4.3. Online recommendation estimation

    Online computations include transforming as data, neighbor-ormations, contributions of each pair of random numbers willrge to zero. Also notice that since users do not add or insertm numbers to all cells in their vectors, some pairs might haveFor similar reasons, approximation coefcients can be estimatedwith decent accuracy from masked data. With increasing level of

    pplications 39 (2012) 38413854The server estimates Paq only; and sends it to a. Once she re-it, she can de-normalize it and gets p0aq:The server canteP0aq;as follows:

  • simx;

    er pairdiffere

    (i)

    ith Aore initiating the transformation process, similarity weightsen all items are estimated using Eq. (7) off-line to form prop-s so that information losses are minimized. We propose twont approaches to form the most similar items, as follows:Befbetwey cos~x;~y x ykxk2 kyk2: 7P0aq PNa

    u1wau Vauzuq ruqPNau1wau Vau

    PNa

    u1wauzuq PNa

    u1wauruq PNa

    u1Vauzuq PNa

    u1VauruqPNau1wau

    PNau1Vau

    ;

    P0aq PNa

    u1wauzuqPNau1wau

    : 6

    Due to the random numbers distribution, the expected values of thelast three summations in nominator and the last summation indenominator part are zero. Thus, the equation holds. The contribu-tion of both noise data in similarity weights and random valuesadded to as z-scores is expected to be zero and converges to zero.In other words, the server is still able to estimate P0aq from disguiseddata without violating users privacy. Although the users are free touse either uniform or Gaussian distribution with a randomly chosenr over a specied range, the server is still able to compute recom-mendations. Due to the z-scores, noise data, lled cells, and variablyperturbing methods, the server cannot derive users private datafrom received data.

    4.4. Reducing D with minimum information loss

    DWT is used to transform a discretely sampled continuous sig-nal into its time and frequency components in signal coding appli-cations. The reason why DWT is very successful in representing asignal in terms of its time and frequency components is that it pro-duces coefcients of approximation and detail from two consecu-tive samples, which are relatively very close to each other in thespectrum due to being the samples of a continuous signal. How-ever, this condition is not satised while applying DWT to recom-menders systems for the purpose of transforming and reducing auser-item matrix supposing items as discrete samples of a signal.In the transformation process, as explained previously, two consec-utive samples produce an approximation coefcient through divi-sion of their sum by

    2

    p. Therefore, if two consecutive samples

    are far from each other in the spectrum, at the end of the transfor-mation, those samples will neutralize each others effect causinginformation loss in original data. We hypothesize that if similaritems are transformed, such transformations reduce informationlosses in the matrix. For this reason, we propose to order itemsof the matrix in such a way that the most similar items are trans-formed together. Due to the reduced information losses, the qualityof the recommendations then can be improved, as well. Since thenatural order is not necessarily suitable for transformation, wepropose to merge the most similar items together to form non-neutralizing consecutive pairs in the matrix. Moreover, this processdoes not affect the online performance because such orderings canbe conducted off-line. Thus, we can improve accuracy without sac-ricing on online performance while protecting users privacy.

    Two items, x and y, can be thought of as two vectors in the ndimensional space and the similarity between them, sim(x,y), ismeasured by means of the cosine of the angle between them, asfollows (Sarwar, Karypis, Konstan, & Reidl, 2001):

    A. Bilge, H. Polat / Expert Systems wSinglearrangement (SA): The most similar items are put adja-cently once and DWT is repeatedly performed on arrangedmatrix.(ii) Multiple arrangements (MA): After putting the most similaritems adjacently, this arrangement process is repeatedbefore all levels of transformation.

    Without privacy concerns, estimating similarities betweenitems using cosine similarity measure is an easy task. However,since users disguise their data or z-scores using RPTs, data collectorshould be able to estimate such weights frommasked data, as well.Note that each user masks her z-scores variably, i.e. using whetherGaussian or uniform distributions with different r values and lbeing 0. The server can estimate sim(x,y) values from perturbeddata, as explained in the following: In Eq. (7), the nominator partincludes a scalar dot product between two vectors. Suppose thatx and y vectors are masked by random vectors X and Y, respec-tively, where X = (X1,X2, . . . ,Xn) and Y = (Y1,Y2, . . . ,Yn). Such scalardot product between two masked vectors can be estimated, asfollows:

    x0 y0 x X y Y Xj2S

    xj Xjyj Yj; 8

    where S shows the number of commonly lled cells. Since randomnumbers are generated using distributions with l being 0, Eq. (8)can be approximated, as follows:Xj2S

    xj Xjyj Yj Xj2S

    xjyj Xj2S

    xjYj Xj2S

    yjXj Xj2S

    XjYj

    Xj2S

    xjyj: 9

    Eq. (9) holds because the expected values of the last three summa-tions are zero due to the nature of the random numbers. Thedenominator part consists of computing the vector lengths. The ser-ver can estimate the vector lengths from disguised data, as follows:

    kx0k2 kx Xk2 Xj2T

    xj Xj2s

    ; 10

    where T shows the number of values x contains. Similarly, Eq. (10)can be written, as follows, without considering taking the squareroot:Xj2T

    xj Xj2 Xj2T

    x2j 2Xj2T

    xjXj Xj2T

    X2j Xj2T

    x2j Xj2T

    X2j : 11

    Eq. (11) holds because random numbers are generated using distri-butions with l being 0 and the expected value of the middle sum-mation is zero. Although Eq. (11) holds, the server needs the rstsummation only to estimate the vector lengths. Thus, it somehowneeds to get rid of the contribution of the random numbers or thelast summation in Eq. (11). It can get rid of the contribution ofthe random numbers, as follows:Xj2T

    xj Xj2 Xj2T

    x2j Xj2T

    X2j Tr2X Xj2T

    x2j ; 12

    where rX is the standard deviation of the random numbers. Aftercomputing such values, the server can take the square root and esti-mates the vector lengths. It nally estimates the similarity weightsbetween various items based on masked data.

    5. Performance and privacy analysis

    The proposed schemes should be analyzed in terms of both on-line and off-line costs like storage, communication, and computa-tion. Notice that off-line costs are not that critical for overall

    pplications 39 (2012) 38413854 3845performance like online costs. Also note that it is more imperativeto analyze the supplementary costs due to our privacy-preservingmeasures. Our proposed schemes have no effect on storage costs.

  • to RPT should indicate how precisely the original value of a ratingcan be estimated from its masked value. We used the measure pro-

    ith AIn other words, online and off-line storage costs do not change dueto privacy-preserving measures. The storage cost is in the order ofO(nm) for both the original and our proposed schemes.

    In our scheme, number of communications in online and off-line phases remains the same compared to the original algorithm.Moreover, during online interactions, amount of data to be trans-mitted remains the same, as well. However, due to data masking,amount of data to be transferred off-line increases. Remember thateach user u inserts default values into uniformly randomly chosenMubu/100 unrated item cells. Thus, amount of data to be trans-ferred during off-line phase augments Mubu/100 times due to pri-vacy concerns.

    Due to the nature of data disguising schemes used in our pro-posed schemes, online computation costs do not change. In theproposed schemes, the server performs the same tasks like it doesin the original algorithm (Russell & Yoon, 2008). Privacy-preserv-ing measures do not cause any additional computation costs dur-ing online phase. However, our proposed schemes introduceextra computation costs during off-line phase. Although onlinecomputation costs are the critical ones for web-based applications,nevertheless we still need to analyze off-line computation costs.Compared to the original algorithm, item-ordering scheme re-quires an extra off-line computation costs. In an n m user-itemmatrix, a similarity computation between two items requires nmultiplication operations according to cosine-based similarity cal-culation. Also since that similarity computation is performedamong all items, there will be m 1 m 2 1 0 m1m2

    2 similarity calculations for all m items. Therefore, whenthe server utilizes SA method to order items, this requires extraoff-line computation costs in the order of O(m2n). If it employsMA approach for item ordering, it needs to apply item orderingthree times, where each time number of calculations decreasesby four times. Although total number of similarity computationsin MA approach increases compared to the ones in SA, the numberof similarity computations are still in the order O(m2n).

    The main focus of this study is to help users feel comfortablewhile providing their personal data for CF purposes by offering pri-vacy-preserving measures. The denition of privacy level of whichusers can feel themselves condent requires preventing any part ofthe system learns about true preferences of users on items alongwith if any of the ratings is an actual preference on an item. Toachieve such goal, we propose both disguising preferences andinserting fake ratings into the user-item matrix in such a mannerthat the prediction production phases and the accuracy of the rec-ommendations affected minimally. There are two aspects if wewant to analyze privacy level of the system according to the pro-posed disguising methods. (i) How can the server discriminate be-tween actual and fake ratings? and (ii) How can it extract the realvalues of the ratings from their disguised states?

    5.1. Determining the actual or fake votes

    The server needs to distinguish between true and forged ratingsfrom received disguised ratings vectors. Note that there are mitems to be rated. If we let me denotes the number of empty cells,mf denotes the number of all ratings in received disguised vector,mu denotes the number of truly unrated items, and mr denotesthe number of true ratings in original vector, then m =me + m-f = mu + mr. Notice that me and mf are known by the server becausethey can be extracted from disguised vector. To determine the trueratings or false votes, the server needs to guess bu and estimatesmrand the set of rated and/or unrated items.

    3846 A. Bilge, H. Polat / Expert Systems w(i) Guessing bu: Although the server does not know bu, it canguess it with a probability because its value depends on bmaxand mf, which are known by the server. The probability ofposed by Agrawal and Aggarwal (2001), which considers the distri-bution of original data along with the perturbing data, as well. Theauthors in Polat and Du (2005)) also utilize the same measure toquantify privacy in context of CF. Agrawal and Aggarwal (2001)propose P(X) = 2h(X) as the measure, where P(X) denote the lengthof the interval, over which a uniformly distributed random variablehas the same uncertainty as X, where h shows differential entropy.The authors then introduce the notion of conditional privacy takinginto account the additional information available in the perturbedvalues. Let X is perturbed by Y, yielding Z as Z = X + Y, then the aver-age conditional privacy of X given Z is P(X|Z) = 2h(X|Z), where h(X|Z)denotes the conditional differential entropy of X given Z. This moti-vates the metric P(X|Z) = 1 2I(X;Z), where I(X;Z) = h(Z) h(Z|X)and it is known as mutual information between the random vari-ables X and Z. In conclusion, since X and Y are independent randomvariables, after disclosing Z, the privacy level of original data X dis-guised by Y is P(X|Z) =P(X) [1 P(X|Z)]. According to this met-ric, we compute privacy levels, P(X|Z), with respect to varyinglevels of perturbation we employed in our experiments, as per-formed in Polat and Du (2005). We assume the distribution of ori-ginal data (X) is Gaussian. Fig. 1 shows P(X|Z) values with varyinglevels of perturbation. As seen in the gure, with increasing level ofperturbation (randomization), privacy levels increase while pri-vacy losses become smaller. Also, it can be seen in the gure thatalthough they perform parallel trends, Gaussian perturbation per-forms slightly better than uniform perturbation. Other challengesguessing the true value of bu for the server is 1-out-of bmaxwhen number of potential inserted fake ratings (mf 2,assuming that each user provides at least two votes) isgreater than or equal to bmax percent of m because mf istoo large to derive any extra information about the intervalfrom which bu is selected. If mf 2 is less than bmax percentof m, then the probability of guessing the true value of bu forthe server is 1-out-of bmf 2 100=mc because suchamount of received ratings means that bu was chosen froma narrower range.

    (ii) Estimating mr: After guessing bu, the server can now estimatemr, where mr =m mu =m [100me/(100 bu)].

    (iii) Guessing the set of actual ratings: After guessing bu and esti-mating mr, the server can guess the set of true ratings orinserted fake votes. For the server, the probability of guess-ing the exact set of actual ratings in a disguised vector is1-out-of Cmfmr , where C

    fg represents the number of ways of

    picking g unordered outcomes from f possibilities.

    If we combine the aforementioned probabilities, the probabilityof guessing the set of actual ratings, referred to as pmr, in a dis-guised ratings vector is

    1-out-ofbmaxCmfmr if mf 2P bmaxm=100bmf 2100=mcCmfmr otherwise

    (

    5.2. Estimating real votes from masked ones

    The server tries to gure out the real values of the ratings fromtheir masked z-scores. Therefore, we need to analyze the level ofprivacy users have and how those privacy levels change over dif-fering levels of perturbation when RPT is employed for data dis-guising. The privacy measure that is used to quantify privacy due

    pplications 39 (2012) 38413854for the server to estimate real votes are guessing which distribu-tion each user u utilizes to generate random numbers, estimatingthe value of ru, which is randomly chosen from the interval

  • 3.75

    4

    4.25

    4.5

    atio

    vs. l

    A. Bilge, H. Polat / Expert Systems with A[0,rmax], and deducing the mean and the standard deviation ofeach users original ratings.

    6. Experiments

    Our privacy-preserving schemes certainly have effects on accu-racy. We propose to perturb users ratings by adding randomnessfor hiding actual preferences of individuals and suggest insertingfake ratings to prevent the server from learning rated items. Theseapproaches denitely cause a decrease on accuracy due to the con-icting goals between privacy and accuracy. We employ an item-ordering approach to minimize information loss due to DWT onuser-item matrix. Although privacy-preserving schemes cause adecrease in accuracy level, item-ordering scheme might reverse

    0.5 0.75 1 21

    1.25

    1.5

    1.75

    2

    2.25

    2.5

    2.75

    3

    3.25

    3.5

    Standard Devi

    Priv

    acy

    Leve

    ls (

    (X|Z

    ))

    Fig. 1. Privacy levelsthis effect and it may improve accuracy further compared to ourprivacy-preserving scheme. As explained in the following, we carryout several experiments on real data to examine how theseschemes affect accuracy.

    We conducted several trials with varying parameters to showhow our privacy-preserving scheme affects the accuracy of DWT-based recommendations. We also performed experiments to dem-onstrate how employing item ordering methods affects the accu-racy of prediction on our proposed privacy-preserving scheme.Experiments were performed on two variations of well-knownpublicly available data set collected by Grouplens at the Universityof Minnesota (www.grouplens.org). Those variations are calledMovieLens Public (MLP) and MovieLens Million (MLM) data sets.Detailed information about them is given in Table 1.

    Evaluation of CF systems in terms of accuracy can be performedusing several metrics measuring the accuracy of the system statis-tically. We applied two different metrics to evaluate our schemesin terms of accuracy: Root mean square error (RMSE) and meanabsolute error (MAE). While RMSE measures the deviation of dif-

    Table 1Descriptions of data sets.

    Name Item n m Range Type Totalvotes

    Density(%)

    MLP Movie 943 1682 (1,5) Discrete 100,000 6.30MLM Movie 6040 3952 (1,5) Discrete 1,000,000 4.25ferences between predicted and actual ratings, MAE assesses theaverage of those differences. Smaller the RMSE and the MAE values,the better the results are.

    For each experiment, we used uniformly randomly chosen trainand test sets. For test purposes, we uniformly randomly chose 40%of all available actual ratings. We used those users who provided atleast 20 and 40 ratings fromMLP and MLM, respectively. Each timea rating is taken into consideration, the user whom that rating be-longs to is treated as active user and remaining users are treated astrain users. Data are transformed for three levels using DWT, asperformed Russell and Yoon (2008).

    We conducted trials to assess our privacy-preserving scheme.The details of such trials, as follows: There are various parametersin our proposed scheme to preserve privacy like n,BestN, bmax, rmax,

    3 4n of Perturbing Data ()

    evel of perturbation.Gaussian Distribution Uniform Distribution

    pplications 39 (2012) 38413854 3847level of transformation, and type of random number distribution.The authors in Russell and Yoon (2008) determined three-leveltransformation as the optimum one so that we also transformeddata three levels and did not run trials to determine its optimumvalue. Similarly, uniform and Gaussian distributions give very sim-ilar results in privacy-preserving collaborative schemes as shownby Polat and Du (2003). Thus, we did not conduct trials to showhow accuracy changes with varying random number distributions.We assumed that half of the users utilize Gaussian distributionwhile the remaining ones use uniform distribution to generate ran-dom numbers. Since accuracy and privacy are conicting goals, it isimportant to ne-tune privacy related parameters to make surethat privacy of individuals is unquestionably protected and accu-racy is at its maximum level. While performing tests to see the ef-fects of a parameter, we held all other parameters constant at apre-determined value and gradually changed the value of theparameter in question. Also, to obtain a baseline to compare ourprivacy-preserving schemes performance, we performed trialsusing the original algorithm explained in Russell and Yoon(2008) without applying privacy-preserving schemes. The detailsof the conducted experiments are, as follows:

    6.1. Number of users (n)

    We conducted trials to see the effects of varying n values in thesystem on accuracy. For this purpose, we uniformly randomly se-lected 125, 250, 500, and 1000 (943) users from MLM and MLP,

  • respectively. We kept other parameters constant at BestN = 25,bmax = 25, and rmax = 2. Due to randomization, we repeated exper-iments for 100 times. After comparing predictions on our schemewith the ones on original data, we computed overall averages ofthe errors. We estimated MAE and RMSE values. Since they showsimilar trends, we displayed MAEs only in Figs. 2 and 3 for MLPand MLM, respectively.

    Without privacy concerns, accuracy becomes better for increas-ing n values for both data sets, as seen from Figs. 2 and 3. Similarly,in general, as the number of users contributing to recommendationprocess increases, the quality of the recommendation improves orthe average MAE of the system decreases with privacy concerns forboth data sets. In our privacy-preserving scheme, however, forboth datasets, there seems an unexpected raise in MAE valueswhile n increases from 125 to 250 users (from 0.8593 to 0.8695for MLP and 0.8207 to 0.8335 for MLM), which happens because250 users might be an improper amount to choose the best neigh-bors. Also, for MLM dataset, there is a slight decrease on accuracy

    (from 0.8249 to 0.8245) in transition from 500 to 1,000 involvingusers. When there are large numbers of users like 500 or more,accuracy becomes stable for both with privacy and without privacycases. Although accuracy becomes worse due to privacy concerns,as expected, such losses are acceptable and it is still possible toprovide referrals with decent accuracy with privacy concerns.

    6.2. Number of neighbors (BestN)

    Number of the best neighbors whose data are used in recom-mendation process is an important parameter. Thus, its effects onaccuracy should be investigated. For this purpose, we ran variousexperiments while varying BestN from 10 to 200. In other words,10, 25, 50, 100, and 200 neighbors are selected to collaborate inprediction producing step according to previously calculated simi-larity values among users in reduced user-item matrices. We keptother parameters constant, where n = 500, bmax = 25, and rmax = 2.Similarly, due to randomization, we repeated experiments for

    125 250 500 9430.78

    0.79

    0.8

    0.81

    0.82

    0.83

    0.84

    0.85

    0.86

    0.87

    0.88

    Number Of Users (n)

    MAE

    s

    Without Privacy With Privacy

    Fig. 2. MAE values for varying n values (MLP).

    0

    3848 A. Bilge, H. Polat / Expert Systems with Applications 39 (2012) 38413854125 250 50

    0.72

    0.74

    0.76

    0.78

    0.8

    0.82

    0.84

    MAE

    sNum

    Fig. 3. MAE values for var1,000

    Without Privacy With Privacyber of Users (n)

    ying n values (MLM).

  • 100 times. After comparing predictions on our scheme with theones on original data, we computed overall averages of the errors.We estimated MAE values and displayed them in Figs. 4 and 5 forMLP and MLM, respectively.

    Without privacy concerns, 10 and 50 neighbors provide the bestresults for MLP and MLM, respectively, as seen from the gures.With increasing number of neighbors from 10 to 100, accuracy be-comes stable for MLM. For neighbors larger than 100, the quality ofthe predictions becomes worse. For MLP, accuracy decreases withincreasing number of neighbors up to 100; and it becomes stableafter that. Trials to determine the optimum BestN values show thatincorporating 25 neighbors for MLP with an MAE of 0.8695 pro-duce the best results. Similarly, for MLM, 10 neighbors with anMAE of 0.8308 and 25 neighbors with an MAE of 0.8335 achievethe nest outcomes. It is seen from the gures that it would notbe reasonable to collaborate with more than 25 neighbors becauseaccuracy gets worse with more than 25 neighbors with privacyconcerns.

    6.3. Maximum unrated item cells lling percent (bmax)

    According to the proposed protocol, each user u uniformly ran-domly selects a bu value over the range [0,bmax]. Here, bmax deter-mines the maximum unrated item cells lling percent, which isclosely related to accuracy and privacy level of the system. Wetested the effects of this parameter while varying its value from3 to 100. Like previous experiments, we kept other parametersconstant, where n = 500, BestN = 25, and rmax = 2. Due to randomi-zation, we repeated experiments for 100 times. After comparingpredictions on our scheme with the ones on original data, we com-puted overall averages of the errors. We displayed the outcomes inFig. 6 for both data sets.

    bmax, which is an important parameter to qualify individual pri-vacy level, is inversely correlated with accuracy as can be seenfrom Fig. 6. Actually, this fact is expected because larger bmax val-ues distort the original data more, which causes larger accuracylosses. In other words, the bigger the bmax values are, the larger

    10 25 50 100 2000.78

    0.8

    0.82

    0.84

    0.86

    0.88

    0.9

    0.92

    Number of Neighbors (BestN)

    MAE

    s

    Without Privacy With Privacy

    Fig. 4. MAE values for varying BestN values (MLP).

    10

    A. Bilge, H. Polat / Expert Systems with Applications 39 (2012) 38413854 384910 25 50

    0.72

    0.74

    0.76

    0.78

    0.8

    0.82

    0.84

    0.86

    0.88

    MAE

    sNumber of

    Fig. 5. MAE values for varyi0 200

    Without Privacy With PrivacyNeighbors (BestN)

    ng BestN values (MLM).

  • 3 6 12 25 50 1000.8

    0.81

    0.82

    0.83

    0.84

    0.85

    0.86

    0.87

    0.88

    0.89

    0.9

    0.91

    Pe

    MAE

    s

    MLP MLM

    or v

    3850 A. Bilge, H. Polat / Expert Systems with Applications 39 (2012) 38413854the randomness added to the original data is. With increasing ran-domness then, accuracy is expected to become worse. Althoughmaximum accuracy is obtained for a bmax value of 3, it is betterto choose 6 as an optimal selection for privacy concerns becausethere is a negligible difference of 0.0013 for both data sets betweenMAE values of corresponding bmax values.

    6.4. Maximum standard deviation (rmax)

    The last parameter to be tested is rmax. Like bu, each user u uni-formly randomly chooses ru value over the range [0,rmax], as ex-plained in the proposed scheme. The larger the rmax value, thehigher the privacy level (or randomization) and the lower the accu-racy. Therefore, it is important to determine an optimum rmax va-lue along this trade-off. To do so, we ran different experimentswhile changing rmax from 0.5 to 4. Due to randomness, we re-

    Filling

    Fig. 6. MAE values fpeated our trials 100 times. Similarly, we kept other parametersconstant at n = 500, BestN = 25, and bmax = 6. After comparing pre-dictions on our scheme with the ones on original data, we com-

    0.5 1 20.790.8

    0.81

    0.820.83

    0.84

    0.85

    0.86

    0.87

    0.880.89

    0.9

    0.91

    0.92

    Maximum Stand

    MAE

    s

    Fig. 7. MAE values for vputed overall averages of the errors. We showed the results inFig. 7 for both data sets.

    In fact, it is obvious that accuracy and rmax are inversely corre-lated because perturbation level increases with increasing rmax;and that makes accuracy worse. As seen from Fig. 7, with increas-ing rmax values, the quality of the referrals gets worse. The biggersuch values are, the more randomness that we add into the originaldata is. Although accuracy becomes worse with increasing rmaxvalues, privacy enhances due to augmented randomness. It is cru-cial to avoid data holder from estimating ru values for individualprivacy concerns. Therefore, we have chosen an optimal rmax valueof 2 to still produce dependable and accurate recommendationswhile not deeply jeopardizing individual privacy.

    6.5. Overall performance

    rcent (max)

    arying bmax values.After determining how accuracy changes with varying parame-ters, we nally conducted experiments to show the joint effects ofsuch parameters. We used both data sets and set the parameters at

    4ard Deviation (max)

    MLP MLM

    arying rmax values.

  • their optimum values determined previously. We conducted theexperiments for 500 and 1,000 train users for MLM and 500 and943 train users for MLP. Again, we ran the trials 100 times and esti-mated RMSE and MAE values by comparing predictions on ourscheme with the ones on original data. For both data sets, we dis-played the MAE and the RMSE values in Figs. 8 and 9, respectively.

    As expected, the quality of the predictions gets worse due toprivacy concerns. When n is increased from 500 to 1000 or 943,accuracy becomes better for both cases. It is inevitable to compro-mise accuracy to provide individual privacy but also it is importantto provide trustworthy recommendations. Although there areaccuracy losses due to our proposed scheme, such losses are small.In other words, it is still possible to offer recommendations withdecent accuracy using our scheme. For MLP, average relative errordue to privacy concerns is about 4.7% when n is 500. Similarly, forMLM, it is about 10.9%. Like MAE values, the RMSE values become

    worse by about 2.7% and 8.1% for MLP and MLM, respectively dueto our privacy-preserving scheme. Our scheme achieves privacywhile sacricing little on accuracy.

    6.6. Evaluating the SA and the MA methods

    Due to the randomness used to preserve users privacy, accu-racy is expected to decrease. As shown by our experimental results,accuracy slightly becomes worse due to privacy-preserving mea-sures. However; in the case of DWT, we can order items to reduceinformation loss caused by neglecting detail coefcients of trans-formation as explained in previous sections so that we can improveaccuracy. To investigate how item ordering affects accuracy, weconducted experiments using both data sets while utilizing ourproposed two methods to order items. We followed the samemethodology and used the optimum values determined

    MLP500 MLP943 MLM500 MLM1,0000.4

    0.45

    0.5

    0.55

    0.6

    0.65

    0.7

    0.75

    0.8

    0.85

    0.9

    Datasets

    MAE

    s

    Without Privacy With Privacy

    Fig. 8. MAE values for overall performance.

    6

    7x 103

    Without Privacy With Privacy

    A. Bilge, H. Polat / Expert Systems with Applications 39 (2012) 38413854 3851MLP500 MLP9430

    1

    2

    3

    4

    5

    RM

    SEsData

    Fig. 9. RMSE values for oMLM500 MLM1,000

    sets

    verall performance.

  • previously. We provided predictions with privacy concerns usingSA approach rst. Similarly, we then produced recommendationsusing our scheme while utilizing MA method. Again, trials wereperformed 100 times. After estimating overall averages of theMAE values, we displayed the outcomes in Figs. 10 and 11 forMLP and MLM, respectively.

    As can be seen from Figs. 10 and 11, the SA scheme does nothave a positive effect on accuracy for MLP; however, it has aslightly encouraging achievement for MLM. Unlike the SA method,the MA scheme has much more optimistic consequences for bothdata sets. These results actually support our main motivation to ar-range items to reduce information loss. The results presented inFigs. 10 and 11 belong to three-level transformation as stated pre-viously and since an arrangement takes place at every step in theMA scheme, accuracy gets better at the end transformation step.However, in the SA scheme, there is just one arrangement, whichtakes place in the rst transformation step. Therefore, the positiveeffect of arrangement either lost or negated reaching up to thirdtransformation. On the other hand, the MA scheme cumulatesthe enhancement through arrangement and produces more accu-rate results compared to privacy-preserving scheme. Although

    the proposed schemes, especially the MA, improve accuracy, suchimprovements do not entirely compensate the losses due to pri-vacy concerns. However, as can be seen from the outcomes, theyenhance the quality of our privacy-preserving scheme-basedreferrals.

    7. Conclusions and future work

    DWT-based CF scheme is able to overcome the scalability prob-lem and produce recommendations with decent accuracy. How-ever, it fails to protect individual users privacy. We proposed ascheme to solve the privacy problem that DWT-based CF schemefaces. Our privacy-preserving scheme hides users preferencesand the items they rated. As expected, the quality of the referralsbecomes worse due to our privacy-preserving measures. However,compared to the original scheme, relative errors due to our pro-posed scheme are about 4.7% for MLP and 10% for MLM. Such lossescan be considered acceptable because accuracy and privacy areconicting goals. Thus, our scheme is able to achieve privacy whilesacricing little on accuracy. Like accuracy and privacy,

    n = 500 n = 9430.81

    0.815

    0.82

    0.825

    0.83

    0.835

    0.84

    0.845

    MLP Dataset

    MAE

    s

    No Arrangement Single Arrangement Multiple Arrangement

    SA a

    N

    3852 A. Bilge, H. Polat / Expert Systems with Applications 39 (2012) 38413854Fig. 10. MAE values for

    0.795

    0.8

    0.805

    0.81

    0.815

    0.82

    MAE

    sn = 500ML

    Fig. 11. MAE values for SA and MA schemes (MLP).

    o Arrangement Single Arrangement Multiple Arrangementn = 1,000M Dataset

    nd MA schemes (MLM).

  • Acknowledgement

    References

    A. Bilge, H. Polat / Expert Systems with Applications 39 (2012) 38413854 3853Acilar, M. A., & Arslan, A. (2009). A collaborative ltering method based on articialimmune network. Expert Systems with Applications, 36(4), 83248332.doi:10.1016/j.eswa.2008.10.029.

    Agrawal, D., & Aggarwal, C. C. (2001). On the design and quantication of privacypreserving data mining algorithms. In Paper presented at the proceedings of the20th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems.

    Ahmad, W., & Khokhar, A. (2007). An architecture for privacy preservingcollaborative ltering on Web portals. In Paper presented at the proceedings ofthe third international symposium on information assurance and security.

    Ahn, H. J., Kang, H., & Lee, J. (2010). Selecting a small number of products foreffective user proling in collaborative ltering. Expert Systems withApplications, 37(4), 30553062. doi:10.1016/j.eswa.2009.09.025.

    Al-Shamri, M. Y. H., & Bharadwaj, K. K. (2008). Fuzzy-genetic approach torecommender systems based on a novel hybrid user model. Expert Systemswith Applications, 35(3), 13861399. doi:10.1016/j.eswa.2007.08.016.

    Bogdanova, G., & Georgieva, T. (2008). Using error-correcting dependencies forcollaborative ltering. Data & Knowledge Engineering, 66(3), 402413.doi:10.1016/j.datak.2008.04.008.

    Bum Pan, S., & Park, R.-H. (2002). VLSI architectures of the 1-D and 2-D discretewavelet transforms for JPEG 2000. Signal Processing, 82(7), 981992,.doi:10.1016/S0165-1684(02)00195-0.

    Canny, J. (2002a). Collaborative ltering with privacy. In Paper presented at IEEEsymposium on the security and privacy, proceedings.

    Canny, J. (2002). Collaborative ltering with privacy via factor analysis. In Paperpresented at the proceedings of the 25th annual international ACM SIGIR conferenceon research and development in information retrieval.

    Chen, G., Wang, F., & Zhang, C. (2009). Collaborative ltering using orthogonalnonnegative matrix tri-factorization. Information Processing & Management,45(3), 368379. doi:10.1016/j.ipm.2008.12.004.

    Cheng, F.-H., & Chen, Y.-L. (2006). Real time multiple objects tracking andidentication based on discrete wavelet transform. Pattern Recognition, 39(6),11261139.

    Cranor, L. F., Reagle, J., & Ackerman, M. S. (1999). Beyond concern: Understanding netusers attitudes about online privacy.

    Feng, Z., & Hui-You, C. (2005). A collaborative ltering algorithm embedded BPnetwork to ameliorate sparsity issue. In Paper presented at proceedings of 2005This work is supported by Grant 108E221 from TUBITAK.performance is also among the goals that should be accomplished.Applying privacy-preserving methods might introduce supplemen-tary costs. Online performance is extremely critical for the successof CF systems. We showed that although our scheme introducesadditional costs, as expected, they are negligible and they do notimmensely affect online performance.

    In DWT-based CF schemes, item ordering is an important issue.DWT can be applied to naturally ordered items. However, wehypothesized that items can be ordered based on some metricsin such way that accuracy might be enhanced. For this purpose,we presented two item ordering schemes. Although one of themdid not work well for one of the data sets, we demonstrated thatthe MA approach denitely improved accuracy. Our goal was totry to compensate the losses due to our privacy-preserving mea-sures by using item ordering methods. Our experimental resultsdemonstrated that the MA approach is able to reimburse some ofthe losses. Since item ordering is performed off-line, our proposedmethods do not cause any extra online costs.

    To more improve accuracy by utilizing item ordering methods,we are planning to employ different similarity metrics to deter-mine similar items. Finding the best similar items is crucial foritem ordering. Since data are masked, such measures should allowthe server estimating the similarities from perturbed data with de-cent accuracy. Moreover, we will scrutinize how to order itemsbased on similarity weights. Different approaches can be utilizedto order items according to estimated weights. We will also studyhow DWT can be applied to user-item matrices including binaryratings.international conference on the machine learning and cybernetics.Feng, Z., & Hui-you, C. (2006). A collaborative ltering algorithm employing geneticclustering to ameliorate the scalability issue. In IEEE international conference onpaper presented at the e-Business engineering, ICEBE 06.

    Han, P., Xie, B., Yang, F., & Shen, R. (2004). A scalable P2P recommender systembased on distributed collaborative ltering. Expert Systems with Applications,27(2), 203210,. doi:10.1016/j.eswa.2004.01.003.

    Jensen, C., Potts, C., & Jensen, C. (2005). Privacy practices of Internet users: Self-reports versus observed behavior. International Journal of HumanComputerStudies, 63(12), 203227.

    Jeong, B., Lee, J., & Cho, H. (2010). Improving memory-based collaborative lteringvia similarity updating and prediction modulation. Information Sciences, 180(5),602612.. doi:10.1016/j.ins.2009.10.016.

    Kaleli, C., & Polat, H. (2007a). Providing Nave Bayesian classier-based privaterecommendations on partitioned data. In Paper presented at the proceedings ofthe 11th European conference on principles and practice of knowledge discovery indatabases.

    Kaleli, C., & Polat, H. (2007b). Providing private recommendations using NaveBayesian classier (pp. 168173).

    Kim, D., & Yum, B.-J. (2005). Collaborative ltering based on iterative principalcomponent analysis. Expert Systems with Applications, 28(4), 823830..doi:10.1016/j.eswa.2004.12.037.

    Kohrs, A., & Merialdo, B. (2001). Creating user-adapted Websites by the use ofcollaborative ltering. Interacting with Computers, 13(6), 695716.. doi:10.1016/S0953-5438(01)00038-8].

    Korrek, M., & Nizam, A. (2010). Clustering MIT-BIH arrhythmias with ant colonyoptimization using time domain and PCA compressed wavelet coefcients.Digital Signal Processing, 20(4), 10501060.. doi:10.1016/j.dsp.2009.10.019.

    Kumari, R. S. S., & Sadasivam, V. (2007). A novel algorithm for wavelet based ECGsignal coding. Computers & Electrical Engineering, 33(3), 186194.

    Liu, R.-R., Jia, C.-X., Zhou, T., Sun, D., & Wang, B.-H. (2009). Personalrecommendation via modied collaborative ltering. Physica A: StatisticalMechanics and its Applications, 388(4), 462468.. doi:10.1016/j.physa.2008.10.010.

    Ma, H., King, I., & Lyu, M. R. (2007). Effective missing data prediction forcollaborative ltering. In Paper presented at the proceedings of the 30th annualinternational ACM SIGIR conference on research and development in informationretrieval.

    McLaughlin, R. (2003). Big box retail in the new economy: Place-making in the era ofthe electronic Milkman.

    Merve Acilar, A., & Arslan, A. (2009). A collaborative ltering method based onarticial immune network. Expert Systems with Applications, 36(4), 83248332..doi:10.1016/j.eswa.2008.10.029.

    Plas, R. V. D., Moor, B. D., & Waelkens, E. (2008). Discrete wavelet transform-basedmultivariate exploration of tissue via imaging mass spectrometry. In Paperpresented at the proceedings of the 2008 ACM symposium on applied computing.

    Polat, H., & Du, W. (2003). Privacy-preserving collaborative ltering usingrandomized perturbation techniques. In Paper presented at the proceedings ofthe third IEEE international conference on data mining.

    Polat, H., & Du, W. (2005). SVD-based collaborative ltering with privacy. In Paperpresented at the proceedings of the 2005 ACM symposium on applied computing.

    Polat, H., & Du, W. (2006). Achieving private recommendations using randomizedresponse techniques (pp. 637646).

    Roh, T. H., Oh, K. J., & Han, I. (2003). The collaborative ltering recommendationbased on SOM cluster-indexing CBR. Expert Systems with Applications, 25(3),413423.. doi:10.1016/S0957-4174(03)00067-8].

    Russell, S., & Yoon, V. (2008). Applications of wavelet data reduction in arecommender system. Expert Systems with Applications, 34(4), 23162325..doi:10.1016/j.eswa.2007.03.009.

    Sarwar, B., Karypis, G., Konstan, J., & Reidl, J. (2001). Item-based collaborativeltering recommendation algorithms. In Paper presented at the proceedings of the10th international conference on World Wide Web.

    Symeonidis, P., Nanopoulos, A., Papadopoulos, A. N., & Manolopoulos, Y. (2008).Collaborative recommender systems: Combining effectiveness and efciency.Expert Systems with Applications, 34(4), 29953013.. doi:10.1016/j.eswa.2007.05.013.

    Vozalis, M., Markos, A., & Margaritis, K. (2009). On the performance of SVD-basedalgorithms for collaborative ltering. In Paper presented at the proceedings of the2009 fourth Balkan conference in informatics.

    Vozalis, M. G., & Margaritis, K. G. (2007). Using SVD and demographic data for theenhancement of generalized collaborative ltering. Information Sciences,177(15), 30173037.. doi:10.1016/j.ins.2007.02.036.

    Xue, G.-R., Lin, C., Yang, Q., Xi, W., Zeng, H.-J., Yu, Y., et al. (2005). Scalablecollaborative ltering using cluster-based smoothing. In Paper presented at theproceedings of the 28th annual international ACM SIGIR conference on research anddevelopment in information retrieval.

    Yakut, I., & Polat, H. (2007). Privacy-preserving eigenstate-based collaborativeltering. In Paper presented at the proceedings of the security 2nd internationalconference on advances in information and computer security.

    Yildirim, H., & Krishnamoorthy, M. S. (2008). A random walk method for alleviatingthe sparsity problem in collaborative ltering. In Paper presented at theproceedings of the 2008 ACM conference on recommender systems.

    Yu, G., & Kamarthi, S. V. (2010). A cluster-based wavelet feature extraction methodand its application. Engineering Applications of Articial Intelligence, 23(2),

    196202,. doi:10.1016/j.engappai.2009.11.004.

  • Zhang, L., Xiao, B., & Guo, J. (2008). A hybrid approach to collaborative ltering forovercoming data sparsity. In Paper presented at 9th international conference onsignal processing, ICSP 2008.

    Zhang, S., Wang, W., Ford, J., Makedon, F., & Pearlman, J. (2005). Using singular valuedecomposition approximation for collaborative ltering. In Paper presented atthe proceedings of the seventh IEEE international conference on E-commercetechnology.

    Zhao, X., & Ye, B. (2009). Similarity of signal processing effect between Hankelmatrix-based SVD and wavelet transform and its mechanism analysis.Mechanical Systems and Signal Processing, 23(4), 10621075.. doi:10.1016/j.ymssp.2008.09.009.

    Ziqiang, W., & Boqin, F. (2004). Collaborative ltering algorithm based on mutualinformation (pp. 405415).

    3854 A. Bilge, H. Polat / Expert Systems with Applications 39 (2012) 38413854

    An improved privacy-preserving DWT-based collaborative filtering scheme1 Introduction2 Related work3 DWT-based collaborative filtering4 Privacy-preserving DWT-based CF scheme4.1 Individual user privacy protection by randomization4.2 Reducing D' using discrete wavelet transform4.3 Online recommendation estimation4.4 Reducing D with minimum information loss

    5 Performance and privacy analysis5.1 Determining the actual or fake votes5.2 Estimating real votes from masked ones

    6 Experiments6.1 Number of users (n)6.2 Number of neighbors (BestN)6.3 Maximum unrated item cells filling percent (6.4 Maximum standard deviation (max)6.5 Overall performance6.6 Evaluating the SA and the MA methods

    7 Conclusions and future workAcknowledgementReferences