6
Privacy-Preserving Collaborative Filtering on Overlapped Ratings Burak Memis Dumlupinar University Dept. of Computer Engineering Kutahya, Turkey [email protected] Ibrahim Yakut Anadolu University Dept. of Computer Engineering Eskisehir, Turkey [email protected] Abstract— To promote recommendation services through prediction quality, there are some privacy-preserving collaborative filtering (PPCF) solutions enabling e-commerce parties to collaborate on partitioned data. It is almost probable that both parties hold ratings for the identical users and items simultaneously; however existing PPCF schemes have not explored such overlaps. Since rating values and rated items are confidential, overlapping ratings makes privacy-preservation more challenging. This study examines how to estimate predictions privately based on partitioned data with overlapped entries between two e-commerce companies and we propose novel PPCF schemes in this sense. Keywords- Collaborative Filtering, Data Scarcity, Overlapped Ratings, Privacy I. INTRODUCTION Recommender Systems have recently become so important and popular in the context of e-business applications [1]. Not only such systems facilitate decision process of users having limited time for consuming on the web, but also informs internet users about music, film and books which they intend to taste. Recommender systems especially exploit content-based information about an item or ratings on items collected from users. While the former technique is named as content-based filtering, the latter one is collaborative filtering. According to [2] collaborative filtering (CF) shows significantly better performance over the other due to not requiring content analysis for the complex items and providing the ability to recommend items on taste information. User ratings on products are the crucial information to sustain the collaborative filtering recommendation services. However, e-commerce companies may suffer from scarce of ratings which prevents them from offering quality CF services. One solution for such problematic is to collaborate with another data holder company for featured recommendation services. However, rating data can be subject to the privacy risks [3] and e-commerce companies are responsible about the confidentiality of held data[4]. In order to encourage such parties for cooperation, privacy metrics need to be provided. For this reason, a range of privacy-preserving collaborative filtering (PPCF) schemes are proposed considering partitioned data [5, 6]. By means of privacy-preserving contribution of bonus data, data scarcity problem can be tackled and companies can provide recommendations in satisfactory quality and quantity. In this study, we examine how can two parties ended up with partitioned data having overlapped ratings promote recommendation services ensuring corporate data privacy. The challenge is to propose PPCF solution which increases the prediction quality while guaranteeing confidentiality of data held by each other. Our contributions with this study can be reported as overlapped ratings notion in PPCF on partitioned data and a PPCF solution for two parties with overlapped rating data. The paper is organized, as follows: Next section, we highlight our problem in the state-of-the-art while giving related definitions and preliminaries in the Section 3. After demonstrating and theoretically analyzing our proposal in Section 4, we present experimental setup and results in the following section. Finally we conclude the study and give future research directions. II. LITERATURE REVIEW Cooperative data mining over partitioned data is widely offered for data scarcity problem and in this context two parties can end up with three kinds of data configurations: horizontal, vertical, or arbitrary. Kantarcioglu and Clifton [7] address the secure mining of association rules over horizontally partitioned data belongs disjoint set of objects for the same set of attributes. Vaidya et al [8] introduce a generalized privacy-preserving variant of ID3 decision tree algorithm for vertically partitioned data where there are records of the same set of objects for the disjoint set of attributes. Jagannathan and Wright [9] introduce privacy- preserving k-means clustering algorithm on arbitrarily partitioned data which is comprised of arbitrarily entries for the same set of objects and attributes in any party. Our partitioning can be considered arbitrarily partitioned data, however, in [9] if there are some data values held by both parties, such values should be processed by only one of the party. Since their privacy scope does not consider which records are missing and/or entered, such overlapping situations can be easily handled. But, our privacy perception covers which items are rated by which users and such overlaps increase the complexity of PPCF problem. PPCF challenges are examined through different directions. Some authors [10, 11] consider collection of user data into a centralized server as privacy hazard and propose concordance metrics [10] and trusted coalition of server architectures [11] in distributed computation settings. Rather than their concerns about the existence of central server, we care about how two parties cooperate properly for CF 2013 Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises 978-0-7695-5002-2/13 $26.00 © 2013 IEEE DOI 10.1109/WETICE.2013.55 40 2013 Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises 978-0-7695-5002-2/13 $26.00 © 2013 IEEE DOI 10.1109/WETICE.2013.55 164 2013 Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises 978-0-7695-5002-2/13 $26.00 © 2013 IEEE DOI 10.1109/WETICE.2013.55 166 2013 Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises 978-0-7695-5002-2/13 $26.00 © 2013 IEEE DOI 10.1109/WETICE.2013.55 166

[IEEE 2013 IEEE 22nd International Workshop On Enabling Technologies: Infrastructure For Collaborative Enterprises (WETICE) - Hammamet, Tunisia (2013.06.17-2013.06.20)] 2013 Workshops

  • Upload
    ibrahim

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 IEEE 22nd International Workshop On Enabling Technologies: Infrastructure For Collaborative Enterprises (WETICE) - Hammamet, Tunisia (2013.06.17-2013.06.20)] 2013 Workshops

Privacy-Preserving Collaborative Filtering on Overlapped Ratings

Burak Memis Dumlupinar University

Dept. of Computer Engineering Kutahya, Turkey

[email protected]

Ibrahim Yakut Anadolu University

Dept. of Computer Engineering Eskisehir, Turkey

[email protected]

Abstract— To promote recommendation services through prediction quality, there are some privacy-preserving collaborative filtering (PPCF) solutions enabling e-commerce parties to collaborate on partitioned data. It is almost probable that both parties hold ratings for the identical users and items simultaneously; however existing PPCF schemes have not explored such overlaps. Since rating values and rated items are confidential, overlapping ratings makes privacy-preservation more challenging. This study examines how to estimate predictions privately based on partitioned data with overlapped entries between two e-commerce companies and we propose novel PPCF schemes in this sense.

Keywords- Collaborative Filtering, Data Scarcity, Overlapped Ratings, Privacy

I. INTRODUCTION Recommender Systems have recently become so

important and popular in the context of e-business applications [1]. Not only such systems facilitate decision process of users having limited time for consuming on the web, but also informs internet users about music, film and books which they intend to taste. Recommender systems especially exploit content-based information about an item or ratings on items collected from users. While the former technique is named as content-based filtering, the latter one is collaborative filtering. According to [2] collaborative filtering (CF) shows significantly better performance over the other due to not requiring content analysis for the complex items and providing the ability to recommend items on taste information.

User ratings on products are the crucial information to sustain the collaborative filtering recommendation services. However, e-commerce companies may suffer from scarce of ratings which prevents them from offering quality CF services. One solution for such problematic is to collaborate with another data holder company for featured recommendation services. However, rating data can be subject to the privacy risks [3] and e-commerce companies are responsible about the confidentiality of held data[4]. In order to encourage such parties for cooperation, privacy metrics need to be provided. For this reason, a range of privacy-preserving collaborative filtering (PPCF) schemes are proposed considering partitioned data [5, 6]. By means of privacy-preserving contribution of bonus data, data scarcity problem can be tackled and companies can provide recommendations in satisfactory quality and quantity.

In this study, we examine how can two parties ended up with partitioned data having overlapped ratings promote recommendation services ensuring corporate data privacy. The challenge is to propose PPCF solution which increases the prediction quality while guaranteeing confidentiality of data held by each other. Our contributions with this study can be reported as overlapped ratings notion in PPCF on partitioned data and a PPCF solution for two parties with overlapped rating data. The paper is organized, as follows: Next section, we highlight our problem in the state-of-the-art while giving related definitions and preliminaries in the Section 3. After demonstrating and theoretically analyzing our proposal in Section 4, we present experimental setup and results in the following section. Finally we conclude the study and give future research directions.

II. LITERATURE REVIEW Cooperative data mining over partitioned data is widely

offered for data scarcity problem and in this context two parties can end up with three kinds of data configurations: horizontal, vertical, or arbitrary. Kantarcioglu and Clifton [7] address the secure mining of association rules over horizontally partitioned data belongs disjoint set of objects for the same set of attributes. Vaidya et al [8] introduce a generalized privacy-preserving variant of ID3 decision tree algorithm for vertically partitioned data where there are records of the same set of objects for the disjoint set of attributes. Jagannathan and Wright [9] introduce privacy-preserving k-means clustering algorithm on arbitrarily partitioned data which is comprised of arbitrarily entries for the same set of objects and attributes in any party. Our partitioning can be considered arbitrarily partitioned data, however, in [9] if there are some data values held by both parties, such values should be processed by only one of the party. Since their privacy scope does not consider which records are missing and/or entered, such overlapping situations can be easily handled. But, our privacy perception covers which items are rated by which users and such overlaps increase the complexity of PPCF problem.

PPCF challenges are examined through different directions. Some authors [10, 11] consider collection of user data into a centralized server as privacy hazard and propose concordance metrics [10] and trusted coalition of server architectures [11] in distributed computation settings. Rather than their concerns about the existence of central server, we care about how two parties cooperate properly for CF

2013 Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises

978-0-7695-5002-2/13 $26.00 © 2013 IEEE

DOI 10.1109/WETICE.2013.55

40

2013 Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises

978-0-7695-5002-2/13 $26.00 © 2013 IEEE

DOI 10.1109/WETICE.2013.55

164

2013 Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises

978-0-7695-5002-2/13 $26.00 © 2013 IEEE

DOI 10.1109/WETICE.2013.55

166

2013 Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises

978-0-7695-5002-2/13 $26.00 © 2013 IEEE

DOI 10.1109/WETICE.2013.55

166

Page 2: [IEEE 2013 IEEE 22nd International Workshop On Enabling Technologies: Infrastructure For Collaborative Enterprises (WETICE) - Hammamet, Tunisia (2013.06.17-2013.06.20)] 2013 Workshops

services over their data in own servers. In this context, there are a range of PPCF proposals concerning how to improve recommender services using partitioned data among two parties. Polat and Du [5] offer top-N recommender solution operating on horizontally partitioned data. Kaleli and Polat [12] examine binary predictions on like and dislike values of users between two party via naïve Bayesian classifier (NBC) in privacy-preserving manner. They present two different solutions for each horizontally and vertically partitioned data. Similarly considering two partitioning cases particularly, Yakut and Polat [13] propose SVD-based PPCF schemes. Arbitrarily partitioning cases are also investigated in the context of PPCF, too. In another work they [14] examine how two parties can provide recommendations using item-based CF techniques on arbitrarily partitioned data and propose schemes in this variant. The same authors in another work [6] also offer privacy-preserving NBC-based CF scheme producing binary referrals on arbitrarily partitioned data. Our motivation is the same with so mentioned studies, however this study come forward with overlapped ratings. In the previous PPCF works, no ratings are overlapped; however, we are going to scrutinize overlapped ratings among two parties in this study.

III. PRELIMINARIES

A. Collaborative Filtering The main idea of CF systems is to conclude customers'

preferences with respect to preferences of similar set of users, i.e. neighbors. By the way, CF systems produce a prediction paq for active user (a), about target item (q) using n×m user-item rating matrix where n and m is number of users and items, respectively. There are mainly three steps in a typical CF process: similarity computations, neighborhood determination and prediction generation based on the similarity weighted average of neighbor’s ratings on q. First of all, similarity between a and train user u can be computed using Pearson correlation coefficient as

,.

)).((

ua

Cj uujaajau

rrrrw

σσ� ∈

−−= (1)

where C, wau, ruj, ur and uσ represent commonly rated items, similarity between a and train user u, given rating value by u on item j, user u's mean and user u's standard deviation, respectively [2]. After calculating similarity between a and each train user u, a’s neighborhood is determined from the best similar users. Then, the final prediction paq equals to the similarity weighted average of ratings given by the neighbors for q as

,).(

��

∈ −+=

auNu

uuqauNuaaq w

rrwrp (2)

where N stands for a’s neighbors [2].

B. Overlapped Ratings Two parties, say A and B, want to provide CF services on

partitioned data with overlapped ratings. They have similar sets of customer and item portfolios. According to Fig. 1, with respect to rating belongings there are three subsets of ratings: RA, RB and R . While RA and RB hold ratings only belong to A and B, respectively, R includes overlapped ratings given by the same user for the same item to the both parties. If R is empty, there is no rating overlap and the partitioning case become APD as examined in [14]. However, as discussed in Section II, such overlaps make our study more challenging through prediction quality and privacy-preservation compared to APD. Fig. 1 also demonstrates the sparsity of CF rating data which have many unrated items shown with empty cells. In our configuration, for the sake of simplicity, we assume that overlapped ratings are consistent, thus, users have already given the same rating value for the same item in both parties’ data.

Figure 1. Partitioned data with sample overlapped ratings

C. Privacy and the Problem In the context of PPCF [6], the private is each rating

values and which items are rated by which user. So the privacy-preservation is not to allow any party to learn the private from other party’s data. The proposal also does not allow any intermediate value exchange divulging the private to each other since the collaborating parties are semi-honest inferring as much from the available aggregates while obeying the predefined procedure. The problem is how to promote the quality of CF recommendation services partitioned data with overlapped ratings while parties are sure about privacy-preservation of own data.

We solve such problem with aid of default votes and homomorphic cryptosystems (HCs). Based on public cryptosystems infrastructures, Paillier HC [15] can perform addition of two numbers as ciphertext and obtain encrypted version of the actual sum. Suppose that a and b are two numbers and �K is encryption function with key (K). Then, the ciphertexts of the numbers are �K(a) and �K(b) and their multiplication is �K(a) × �K(b) = �K(a + b). Additionally in analogous manner multiplication of plaintext can be performed as �K(a)b = �K(ab) . Paillier HC has self-blinding property permitting publicly modification of ciphertexts by multiplying with R without affecting the plaintext, where R

41165167167

Page 3: [IEEE 2013 IEEE 22nd International Workshop On Enabling Technologies: Infrastructure For Collaborative Enterprises (WETICE) - Hammamet, Tunisia (2013.06.17-2013.06.20)] 2013 Workshops

is random integer value and is modulus of the operated public cryptosystem.

IV. PRIVACY-PRESERVING COLLABORATIVE FILTERING ON OVERLAPPED RATINGS

We give alternatively two-fold solution framework as seen in Fig. 2 where building boxes given from any party j’s view while k stands for the other party. In our first solution, namely plain scheme, we investigate the problem without eliminating overlaps while the alternative scheme determines overlaps privately then get rid of them. In this section, we present our proposals in detail.

Figure 2. Our two-fold solution framework

A. Preprocessing As regarding to Eq. 1, it can be said that each party need

to normalize own data. To perform such normalization, each party needs user means. In order to denominator in the same equation they need standard deviation of each user. Mean and standard deviation are statistically algebraic measures which are composed of distributed measures. Distributed measures can be easily calculated in distributed manner. For example, arithmetic mean equals sum of numbers in an array divided by count of this array. If the array is partitioned among two parties then by exchanging partial sum and partial size each party can obtain mean of the elements in array. However for our study, direct exchange of such statistical measures may cause some privacy breaches especially there a small amount of available ratings from a user. To ensure privacy, we offer randomized default vote filling procedure where default votes can be row mean, column mean, or overall mean from available ratings of party j. After parties agree on level of filling (�) in percentage of density, Party j can enhance own data with vds as given below:

1. Randomly or selectively determine �j from the range [0, �].

2. Randomly select �j·�j% of unrated cells where �j is the number of available ratings.

3. Fill such selected cells with vds. After filling own data, parties can exchange partial sum

and count values and estimate user mean. Then, they normalize their data using deviation from user mean approach and estimate user standard deviation similar to mean estimation. After preprocessing each party end up with estimates of user mean and standard deviation.

B. Similarity Computation To compute similarities, two complete user profiles are

needed. However, such profiles are arbitrarily distributed among two parties. Hence, there two parties and two users then the similarity between users a and u can be considered as follow,

,BBABBAAAau YXYXYXYXXYw +++== (3)

where X, and Y represents the normalized rating profiles of a and u, respectively; Xj and Yj stand for available part of such profiles in party j. Overlaps affect accuracy of recommender, however, we can hypothesize that explainable results can be despite of overlapped rating. In plain approach, we give private similarity computation protocol (PrivateSims) which does not consider overlaps. But we also provide how to tackle with overlaps with preserving privacy in the following subsections.

Private Similarity Computation Protocol (PrivateSims) For each user with a 1 Each party assigns zero to all unrated cells. 2 Each party j computes XjYj . 3 For train user u being 1 to n/2 3.1 A encrypt each element i of XA and YA with his public

key KA 3.2 A sends all )( AiKA Xξ and )( AiKA Yξ to B.

3.3 B computes all BiYAiKA X )(ξ then finds ).( BAKA YXξ

3.4 B computes all BiXAiKA Y )(ξ then finds ).( ABKA YXξ

3.5 B encrypts XBYB with KA. 3.6 Using Paillier’s addition, B finds

).( BBABBAKA YXYXYX ++ξ 3.7 B sends resultant ciphertext to A. 3.8 A decrypts it, add XAYA divide proper �a·�u and obtains

wau. 4 For the remaining train users 4.1 Switching roles, repeat steps 3.1-3.8. 5 Finally, each party has n/2 pieces of n similarities.

PrivateSims protocol’s privacy mechanism is based on Paillier HC. In initial step, we set unrated cells to zero since we intend to utilize absorbing element property of zero during multiplication. In step 2, each party performs partial similarity calculation over only available ratings. With steps 3-4, each party privately compute components of wau and they end up with half of the total similarity values between a and each train user u. Remark also that we exploit self-blinding property of Pailler HC for all encryptions in our scheme in order to discriminate the same plaintext from each other.

C. Prediction Computation Now, we need to compute Eq. 2. Considering that

similarities and ratings are distributed among the parties, Eq.2 can be rearranged as

42166168168

Page 4: [IEEE 2013 IEEE 22nd International Workshop On Enabling Technologies: Infrastructure For Collaborative Enterprises (WETICE) - Hammamet, Tunisia (2013.06.17-2013.06.20)] 2013 Workshops

,)(

)~.~.~.~.(

��

++++

+=∈

auBauANu

uqBauBuqAauBuqBauAuqAauANuaaq ww

rwrwrwrwrp (4)

where aujw and uqjr~ stands for similarity values and normalized rating of u on q held by party j, respectively. We propose privately prediction computation protocol (PrivatePreds) for so distributed similarities and ratings. First of all, such protocol is demonstrated for the case which A is master party queried for paq. If master party is B then they must switch the roles and, go on. We determine a’s neighbors based on threshold (�) and select neighbors comprised of users having similarities greater than � in step 1. In step 3, each party generates binary clone rating vector whose entries having value of one whether q is rated by u, otherwise it is zero. Since one is an identity element for multiplication, we use the binary clones to add up proper similarity values in the denominator. In steps 4-8, B computes for numerator while in step 9 computations are performed for denominator. In step 11, jauAw )( stands for similarity values available in A exploited in numerator calculation by party j. At the end of PrivatePreds, master party returns prediction paq to a. Privately Prediction Computation Protocol (PrivatePreds)

1 Each party assigns zero to all own similarity values below �. 2 Each party assigns zero to all unrated cells for q. 3 Each party generate binary clone rating vector )( uqjc .

4 A encrypt each element i of auAw , uqAr~ and uqjc with KA

5 A sends all )( auAKA wξ and )~( uqAKA rξ values to B.

6 B computes uqBrauAKA w

~)(ξ then obtains )~( uqBauAKA rwξ .

7 B computes auBwuqAKA r )~(ξ then obtains )~( uqAauBKA rwξ .

8 B computes uqBauBrw ~ and encrypts it with KA.

9 B repeats steps 6-8 replacing uqjr~ with proper uqjc .

10 B adds up and finds )~~~( uqBauBuqAauBuqBauAKA rwrwrw ++ξ and

))(( auBBauAKA ww +ξ .

11 A decrypts them, add uqAauArw ~ to the former and AauAw )(

to the latter. 12 A divide numerator to denominator, add a’s mean, find

prediction paq.

D. Removing Overlaps As seen from Fig. 2, in order to remove overlaps, there

are two processes: getting rid of initially filled votes (smoothing) and privately determining and removing overlaps (privately match & remove). In the first step, each party delete vds after preprocessing. By the way, such vds are avoided to cause additional overlaps. In the second step, since the value of overlapped ratings for the same user-item pair is equal, we consider that any party’s awareness of whether such overlapped item is rated makes no sense about privacy. Hence, in privacy scope, there is no problem for

parties to learn which ratings are overlapped. So, then the problem come down to privately determination of which ratings overlapped. Such problem can be deliberated as two parties having two sets want to find commonly existing items. In privacy-preserving data mining, such problem takes so quiet attention and some privacy-preserving set intersection protocols are proposed for parties having confidential data. In this context, Freedman et al [16] present some efficient schemes and in order to find overlaps, we prefer to apply one of them, namely private matching for semi-honest parties(PM-Semi-Honest). PM-Semi-Honest scheme is a two-party protocol between chooser and sender both having different size of sets having numbers from the same domain. At the end of the protocol, chooser learns which of inputs shared by both of them.

We propose privately matching and removing overlaps protocol (Privately Match & Remove) in order to tackle with overlaps. Initially, each party j finds indices of rated cells and computes cutting index point (pci) where pci=(n·m)/2. Finally, each party j ends up with knowledge of approximately half of the total overlaps and deletes ratings held by j having indices corresponding such overlaps. After removing overlaps, parties go on the process with PrivateSims. By the way this solution can be named as ultimate scheme (US). If the parties do not need or prefer to remove overlaps, plain scheme (PS), which does not involve overlap removing process, can be applied. Privately Matching and Removing Overlaps Protocol (Privately Match & Remove) 1 Each party j finds indices of rated cells and computes pci 2 For rating index from the first to pci 2.1 Set A as chooser and B as sender 2.2 Apply PM-Semi-Honest 2.3 A learns about half of the overlaps and remove

corresponding rating values 3 For rating index from pci to the end 3.1 Switch parties’ roles in steps 2.1-2.3, B removes

remaining of the overlaps

V. ANALYSIS AND SIMULATION RESULTS

A. Analysis of the Schemes We propose two PPCF solutions for two parties ending

up with overlapped ratings. First of all, we claim that our schemes meet the privacy requirements mentioned in Section 3.3. Via homomorphic encryption and randomized filling with default votes, confidentiality of rated items and rating values are ensured. In preprocessing step, we exploit vds to avoid share of actual sum, count, and sum of squares. Such vds not only guarantee when there a few number ratings for in a row (user) but also enhances privacy of PPCF schemes, especially in PS. While PS promises the more privacy with respect to US due to existence of vds, US clarify the PPCF from the accuracy effects of overlaps and randomized filling. Similarly, among the type of vds column mean can be said that more privacy enhancing solution since overall is the same for all the Party j’s data and available local user-mean (sum/count) is shared with the other party when row mean selected for vds .

43167169169

Page 5: [IEEE 2013 IEEE 22nd International Workshop On Enabling Technologies: Infrastructure For Collaborative Enterprises (WETICE) - Hammamet, Tunisia (2013.06.17-2013.06.20)] 2013 Workshops

Paillier [15] proved that his homomorphic cryptosystem achieves semantic security and the privacy-preservation of our protocols PrivateSims and PrivatePreds is directly based on such evidence. The privacy of Privately Match & Remove is fulfilled by Freedman et al’s PM-Semi-Honest [16] which is based on Paillier’s scheme or subsequent versions of it, too. Self-blinding property make much more sense for the case of unrated cells. There are numerous unrated cells and may be some cells having the same value and such property camouflages so mentioned entries.

Since privacy and efficiency two clashing goals, privacy-preservation mechanisms bring with additional communicational, computational and storage requirements. The parties agreeing on our PPCF should consider about two of communications over each protocol. Computational overheads are dominated by homomorphic encryption. To reach more efficient performance, some improvements such as pre-computation of normalization, similarity values, predictions before the a’s request may be possible. However, the parties must be ready for additional storage overheads in this case. For example, there will be requirement of n2/2 of floating point number space.

B. Simulation Results In our experiments, we use MLP dataset having ratings

from 943 users for 1682 movies. It is collected by GroupLens research community and publicly available at their web site, www.grouplens.org. There are totally 100,000 integer ratings from the domain of [1, 5]. In our experiments we divide such available ratings into train and test subsets having 90% and 10% of available ratings randomly assigned to corresponding subsets, respectively. Ratings in the train subsets are utilized to achieve CF algorithm and generate prediction while actual rating values in test subset are compared with predicted value to observe prediction quality in terms of accuracy. To evaluate accuracy, we use mean absolute error (MAE) which is popularly exploited in CF researches [2]. MAE equals average of absolute differences between predicted values and actual test ratings. To reach dependable results, we perform 100 trials for each experiment and in each trial train and test ratings are determined, randomly. Each displayed MAE value is the average of MAEs obtained from all trials for each experiment.

First of all, we want to observe how ratio of overlaps changes with varying density of rating data and level of filling. We perform trials by increasing � from 10 to 100 and � from 0 to 100 and demonstrate the percentages of overlaps in Table 1. Such percentage values reflect number of overlaps over the cardinality of union of ratings among both parties. When the data type is whole, the all available 100,000 ratings are taken into account and then the ratings are selected, randomly. Else, such ratings are determined from train data consisting of 90,000 ratings. Note that when � is 0, there is no filling and when � is 100, there may be default votes as much as actual ratings. According to Table I, with increasing density overlapping ratio increases for all

of the rows. However, such ratio is inversely proportional to � since rating values can only be from fixed 90,000 cells while vds can be assigned to remaining cells, i.e. 1,496,126 cells.

TABLE I. RATIO OF OVERLAPS (%) VS. DENSITY AND FILLING LEVEL

Data Type

Filling (�)

Density (�) 10 20 40 60 80 100

Whole 0 5.28 11.11 24.99 42.86 66.67 100.00

Train 0 4.70 9.93 21.95 36.97 56.26 81.82

Train 10 4.47 9.40 20.67 34.72 52.24 75.29

Train 20 4.29 8.93 19.71 32.29 48.32 71.34

Train 40 3.94 8.11 18.08 29.69 43.22 61.87

Train 60 3.64 7.57 16.65 26.83 40.37 55.70

Train 80 3.49 7.13 15.50 24.88 36.22 49.89

Train 100 3.18 6.84 14.65 22.86 34.64 48.59

In the second experiment, we examine how accuracy changes with different levels of filling. For this reason, we vary � from 10 to 100 and compute MAE values for PS and US for such � values. Regarding to analysis in Section V, we select column mean as vd. We set � as 60 then each party holds %60 of ratings randomly selected from train subset and 36,97% of them are expected to be overlapped according to Table I. MAEs of PS and US with respect to varying � are given in Fig.3. As seen from Fig.3, two schemes show different accuracy characteristics against increasing �. While accuracy of PS worsen with the large level of filling, that of US get better insignificantly and US has the lowest MAEs for all � values. The figure shows that � does not affect accuracy of US as much as PS since US get rids of vds by smoothing process. For each scheme, the best MAEs are 0.7513 and 0.7442 achieved at PS (� =20) and US (� =60), respectively.

Figure 3. Accuracy with respect to varying level of filling

In the context of our study, any party can produce prediction using three different methods: singly without collaboration and our schemes PS and US. In this set of experiments, we consider such three methods and compute MAEs for varying densities from 10 to 80. From Fig.3, we

44168170170

Page 6: [IEEE 2013 IEEE 22nd International Workshop On Enabling Technologies: Infrastructure For Collaborative Enterprises (WETICE) - Hammamet, Tunisia (2013.06.17-2013.06.20)] 2013 Workshops

set � to optimum values 20 and 60 for PS and US, respectively. In Table II, we display such outcomes and corresponding gains obtained PPCF schemes with respect to single evaluation of CF in percentages where Gain(X)=100×(MAESingle-MAEX)/MAESingle and MAEX stands for obtained MAE value from the method X. According to Table II, observed gains due to PPCF schemes get higher with lower densities. Hence, proposed schemes work well for the parties having fewer amounts of ratings. This complies with our motivation which promote the prediction quality of the parties suffer from data scarcity. We also check statistically significance of the results. For example, t-values are 47.60 and 31.14 by the results from PS and US, respectively for �=20. For both t-values, the two-tailed P value is less than 0.0001 and by conventional criteria; the differences between Single Party and each of PPCF schemes are considered to be extremely statistically significant. The other t-values provide the same confidence level for promised accuracies by our schemes, except PS (� =60) and US (� =80). For PS (� =60), t-value is less than 1 and it can be said that it is not statistically significant. And, for US (� =80), t-value equals 2.96 and this means that the two-tailed P value is 0.0035 and by the way the difference caused by US can be said that statistically very significant according to conventional criteria.

TABLE II. OVERALL PERFORMANCE WITH VARYING DENSITY

Method / � 10 20 40 60 80

Single Party 0.9265 0.8381 0.7729 0.7517 0.7416

Plain S. 0.8627 0.7936 0.7624 0.7513 0.7457

Ultimate S. 0.8798 0.8003 0.7562 0.7443 0.7391

Gain (PS) 6.88 5.31 1.35 0.06 -0.54

Gain (US) 5.04 4.51 2.16 0.99 0.34

VI. CONCLUSIONS AND FUTURE WORK In this work, we conceptually introduce the problem of

rating overlaps between two parties in the context of PPCF. We investigate the problem and propose two novel schemes. While PS gives the de facto solution involving main PPCF process blocks without considering rating overlaps, US consists of such blocks and overlap removing process. The empirical results show that our schemes contribute the prediction quality of the parties while ensuring their privacy. Due to the opportunity of operation on overlapped ratings, our schemes promise more practical setup over existing PPCF solutions.

We introduce overlapped ratings in partitioned data and our study is based on a conventional CF method. However, there are some accuracy and efficiency enhancing CF methods. As a future work, such problem can be also discussed considering such improved CF algorithms designed for particular features. We simplify the problem by equalizing the overlapping entries, however, in practice; much more complex overlapping cases could be faced. It is

worth to examine such cases in the privacy-preserving manner, too.

REFERENCES [1] A. Koschmider, T. Hornung, and A. Oberweis, “Recommendation-

based editor for business process modeling,” Data & Knowledge Engineering, vol. 70(6), June 2011, pp. 483-503, doi:10.1016/j.datak.2011.02.002

[2] J.L. Herlocker, J.A. Konstan, A. Borcher, and J.T. Riedl, “An algorithmic framework for performing collaborative filtering,” Proc. 22nd annual international ACM SIGIR conference on Research and development in information retrieval. 1999. Berkeley, CA, USA.

[3] J.A. Calandrino, J.A., A. Kilzer, A. Narayanan, E.W. Felten, and V. Shmatikov, “"You Might Also Like:" Privacy Risks of Collaborative Filtering,” Proc. 2011 IEEE Symposium on Security and Privacy (SP), 2011, Oakland, CA, USA.

[4] OECD, “Guidelines on the Protection of Privacy and Transborder Flows of Personal Data,”, 2005.

[5] H. Polat, and W. Du, “Privacy-Preserving Top-N Recommendation on Horizontally Partitioned Data,” Proc. 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005, Compiegne, France.

[6] I. Yakut and H. Polat, “Estimating NBC-based recommendations on arbitrarily partitioned data with privacy,” Knowledge-based Systems, vol. 36, December 2012, pp. 353-362, doi: 10.1016/j.knosys.2012.07.015.

[7] M. Kantarcioglu, and C. Clifton, “Privacy-preserving distributed mining of association rules on horizontally partitioned data,” IEEE Transactions on Knowledge and Data Engineering, 2004, vol. 16(9): p. 1026-1037.

[8] J. Vaidya, C. Clifton, M. Kantarcioglu, and A.S. Patterson, “Privacy-preserving decision trees over vertically partitioned data,” ACM Transactions on Knowledge Discovery from Data, October 2008, vol: 2(3): pp. 1-27, doi:10.1145/1409620.1409624

[9] G. Jagannathan and R.N. Wright, “Privacy-preserving distributed k-means clustering over arbitrarily partitioned data,” Proc. 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 2005, Chicago, Illinois, USA.

[10] N. Lathia, S. Hailes, and L. Capra, “Private distributed collaborative filtering using estimated concordance measures.” Proc. 1st ACM conference on Recommender Systems (RecSys '07), 2007, Minneapolis, MN, USA.

[11] W. Ahmad and A. Khokhar, "An Architecture for Privacy Preserving Collaborative Filtering on Web Portals," Proc. 3rd International Symposium on Information Assurance and Security, 2007, Manchester, UK.

[12] C. Kaleli and H. Polat, “Providing Naïve Bayesian Classifier-based Private Recommendations on Partitioned Data,” Lecture Notes in Computer Science, September 2007, vol. 4702, pp. 515-522. 10.1007/978-3-540-74976-9_53.

[13] I. Yakut and H. Polat, “Privacy-Preserving SVD-Based Collaborative Filtering on Partitioned Data,” International Journal of Information Technology and Decision Making, May 2010, vol. 9(3): p. 473-502, doi: 10.1142/S0219622010003919.

[14] I. Yakut and H. Polat, “Arbitrarily distributed data-based recommendations with privacy,” Data & Knowledge Engineering, February 2012, vol. 72, pp. 239-256, doi:10.1016/j.datak.2011.11.002.

[15] P. Paillier, “Public key cryptosystems based on composite degree residuosity classes,” Lecture Notes in Computer Science, May 1999, vol. 1592, pp. 223-238.

[16] M. Freedman, K. Nissim, and B. Pinkas, “Efficient Private Matching and Set Intersection,” Lecture Notes in Computer Science, 2004, vol. 3027, pp. 1-19, doi: 10.1007/978-3-540-24676-3_1.

45169171171