Robust evaluation of products and reviewers in social rating …cs4121/Robust evaluation of products... · 2016-08-14 · Keywords Product rating·Reviewer trust·Iterative voting·Collusive

DOI 10.1007/s11280-013-0242-4

Robust evaluation of products and reviewersin social rating systems

Mohammad Allahbakhsh · Aleksandar Ignjatovic ·Hamid Reza Motahari-Nezhad · Boualem Benatallah

Received: 21 January 2013 / Revised: 6 June 2013 /Accepted: 8 July 2013© Springer Science+Business Media New York 2013

Abstract Social rating systems are widely used to harvest user feedback and tosupport making decisions by users on the Web. Web users may try to exploit suchsystems by posting unfair or false evaluations for fame or profit reasons. Detectingthe real rating scores of products as well as the trustworthiness of reviewers isan important and a very challenging problem. Existing approaches use majority-based methods along with temporal analysis and clustering techniques to tackle thisproblem, but they are vulnerable to massive intelligent collaborative attacks. In thispaper, we propose a set of novel algorithms for robust computation of product ratingscores and reviewer trust ranks. We introduce a supporting framework consisting ofthree main components responsible for calculating a robust rating score for product,behavior analysis of reviewers and trust computation for reviewers. We proposea novel algorithm for calculating robust rating scores for products, in presence ofunfair reviews. We introduce a method to analyze the reviewing behavior of users bybuilding a vector reflecting three important aspects of reviewers’ behavior. Finally,we combine these behavior factors using a fuzzy inference method to arrive at afinal trust score for every reviewer. Extensive evaluation results shows accuracy ofour calculated rating and trust scores as well as robustness of our methods againstcollusive attacks.

M. Allahbakhsh (B) · A. Ignjatovic · H. R. Motahari-Nezhad · B. BenatallahSchool of Computer Science and Engineering, UNSW, Sydney, NSW 2052, Australiae-mail: [email protected]

A. Ignjatovice-mail: [email protected]

H. R. Motahari-Nezhade-mail: [email protected]

B. Benatallahe-mail: [email protected]

World Wide Web (201 ) 1 : –85 73 109

/ Published online: 2013July27

Keywords Product rating · Reviewer trust · Iterative voting · Collusive attacks

1 Introduction

Web 2.0 technologies foster people’s contribution of content to the Web, andinvolvement in doing tasks and solving problems [14, 34]. People can collaborate ingenerating content, participant in social networks, accomplish jobs, answer questions,etc. [14]. An important class of user-generated content is users’ feedback in termsof evaluation of items such as products, services and other content offered on theWeb (we use the term ‘product’ to refer to all types of item). Thousands of websites today solicit feedback from their users on their products. These users may havedifferent levels of skills and expertise, but they may also be biased in their interestsand incentives [36, 37]. For practical reasons, it is not commonplace to verify theidentity of users leaving reviews of products, and more difficult to determine theirgenuine intent and incentives when entering product evaluations. Similarly, it is quitedifficult for another user, looking to buy a product, for example, to assess the qualityof reviews left on products by a particular user or a set of users on the Web.

Nevertheless, the evaluation results (a.k.a. community judgement) play a majorrole in the quality assessment and decision making of other users, and on their trustin people who offer their services on the Web [16]. The community judgement onitems is usually represented as an aggregation of feedback received from communitymembers on the quality of a product or person. The process of collecting andaggregating such feedback takes place through social rating mechanisms [15]. Basedon received feedback, social rating services calculate a rating score which reflects theoverall quality of a product or a person from a community point of view. We use theterm trust to refer to the community judgement on the quality of a person and theterm rating score to represent the overall quality of a product from the communitypoint of view. Examples of social rating services include Yelp1 in which people ratelocal businesses, IMDb2 in which movies are rated, and Amazon online market3

which takes customers reviews on the products they purchase.

1.1 Problem statement

One important issue in existing rating systems which rely on taking feedbackdirectly is identifying and dealing with unfair evaluations, i.e., evaluations that donot reflect the actual quality of the product or the experience of the user in usingthe product. Such evaluations might be entered due to misunderstanding, lack ofexpertise, dishonesty and bias of the evaluator. Evidence from several empiricalstudies confirms that users may try to manipulate the ratings of products by castingunfair evaluations [18, 30, 33, 42]. For example, people may attempt to promote theproducts of their interest by posting or buying supporting feedback or try to demotetheir competitors by casting negative feedback on their products [18, 30].

1http://www.yelp.com2http://www.imdb.com3http://www.amazon.com

World Wide Web (201 ) 1 : –85 73 10974

http://www.yelp.comhttp://www.imdb.comhttp://www.amazon.com

Unfair evaluations can be cast individually or collaboratively [39, 40]. To dealwith individual unfair evaluations feedback is usually weighted by trustworthinessof its caster [19, 35]. This is because it is believed the ratings cast by trustworthyreviewers should have higher impacts on the rating score of a product than ratingscast by untrustworthy users. A low trust usually means that the user is not expectedto produce high-quality reviews [5], and therefore they are assigned a lower weight.Currently, the trust of reviewers is mainly computed based on the previous behaviorof the reviewer in the system as well as the direct or indirect feedback of othermembers [3, 10, 31]. These approaches use local metrics (as opposed to system-wide,global metrics) in computing trust, which are easier to manipulate by ill-intentionedusers. In addition, such users may not have a rich history with the system or nofeedback from other members may be available on their input. Therefore, theseapproaches for computing user trust are vulnerable to manipulation.

Another issue is that the existing rating techniques are not robust enough to detectcollusion attacks such as the formation of collusion groups [33, 39], or the acquisitionof fake trust ranks [39, 40]. In collusion detection, it is also necessary to spotunfair evaluations. Existing collusion detection techniques rely either on temporaland behavioral indicators to identify colluding groups [28, 33] or on continuousmonitoring of the behavior of reviewers, looking for unusual behavior [28, 39].Techniques for identifying colluding groups rely on searching cliques in graphsrepresenting reviews, reviewers, and relationships among them [33]. This problemis NP-hard, leading to approximate and mostly non-optimal solutions [1]. Tuningparameters for alarming indicators or continuous monitoring systems for detectingcollusion attacks, to achieve a relatively low number of false alarms, can be adaunting task [33, 43, 44]. Preparing a suitable training dataset that covers most casesfor such attacks is yet another serious challenge [33]. Moreover, existing collusiondetection systems use local quality metrics, i.e., metrics which are calculated froma small subset of data, e.g., reviews cast by a single reviewer or reviews cast on asubset of products. These metrics are more vulnerable to manipulation. The moreglobal the metric, the more effort is needed to manipulate it. In order to manipulatea quality metric, which is calculated based on the behavior of the entire community,one must change the sentiment of the community. For online communities withmillions of users this might be an intractable task. For example, the dependence ofthe PageRank of any particular web page on the PageRanks of virtually all other webpages on the Internet is what makes the PageRank robust against manipulation [35].Therefore, existing methods for detecting collusion attacks are prone to organizedmanipulation of such users.

Yet another shortcoming of existing methods for social rating services is ignoringhelpfulness of reviews as a factor in computing evaluators’ trustworthiness. Helpful-ness of a review shows to what extent it can help others to make decisions such asbuying a product, using a service, etc. [17]. Helpfulness can be assessed by othercommunity members. When a reviewer posts a review, for instance in Amazon,others can cast their opinions on how useful this review is (See Figure 2). Also,helpfulness can be extracted from reviews using opinion mining techniques [9, 17].We believe that the impact of the rating score with higher helpfulness should behigher than the rating with lower helpfulness. On the other hand, the helpfulness ofreviews cast by trustworthy users should weigh more than reviews cast by potentiallymalicious reviewers. Incorporating the helpfulness of the reviews when calculating

World Wide Web (201 ) 1 : –85 73 109 75

Figure 1 The architecture of our proposed framework

the rating score of products is another less studied issue in the area of social ratingsystems.

1.2 Contributions

In this paper we propose a set of novel algorithms for robust computation of productrating scores and reviewer trust ranks. We introduce a supporting framework con-sisting of three main components: product rating computation, reviewers’ behavioranalysis, and reviewers’ trust computation. The overall architecture of our proposedframework is depicted in Figure 1.

Product rating computation (cf, Figure 1a) relies on aggregated community senti-ment to assess the quality of product reviews. Two main factors are used to determinethe community sentiment: review helpfulness and the degree of consensus amongreviewers. Building on the rating computation approach presented in [2], we haveleveraged these two factors, and proposed a novel algorithm for the robust compu-tation of product ratings. The reviewer consensus factor in the product computationapproach supports giving more importance to the community consensus on the ratingof a product, and giving less importance to outlier ratings.

Reviewers’ behavior analysis The second main contribution of this paper is thecomputation of a behavior vector for every reviewer to identify different aspects ofthe behavior of the member in the community while evaluating products (Figure 1b).The product rating step deciphers a conformance degree for every reviewer whichreflects how much the ratings of the reviewer conform to the community consensusratings, and it is computed based on the aggregation of her ratings’ conformance tothe community on all the reviews that she has contributed. We identify seven behav-ioral factors, each of which reveal a specific aspect of the behavior of the reviewerand use them towards evaluating behavior of the reviewers. These factors, whichare computed over all the reviews the user has contributed, are: (1) Conformanceto Community Consensus, (2) Helpfulness of Reviews, (3) Level of Contribution,(4) Absolute Error Rate, (5) Cumulative Error Rate, (6) Activity Time Window and(7) Evaluating Rate.

Reviewers’ Trust Rank The third contribution of this paper is a novel methodfor calculating the trust ranks of reviewers. Our trust computation approach is

World Wide Web (201 ) 1 : –85 73 10976

illustrated in Figure 1c. It uses the behavior vector for each reviewer to calculatea comprehensive trust rank for her. Both trust and behavior factors are imprecisecognitive concepts which have levels of uncertainty in their nature. For a trust rankor behavior factor to be deemed low or high, the context of the application is veryimportant. We use fuzzy logic [32] to map the behavior vector of reviewers to theirtrust ranks, which also considers their uncertainty and imprecision. There are twoadvantages to using fuzzy logic in trust management: (1) Fuzzy inference is capableof quantifying imprecise uncertain information about the behavior of the person, liketrust, conformance to community, etc. which are usually expressed using linguisticterms such as high or low rather than numeric values (see Section 5). (2) The abilityof defining different membership functions and inference rules makes it easier tocustomize trust management method for various applications without changing thetrust management engine [38].

Note that the presented algorithms address the problems stated in Section 1.1by taking into account global metrics (the consensus of the community), as wellas additional metrics about the cast reviews (e.g., helpfulness and other reviewers’behavioral aspects) in calculating the product ratings and reviewers’ trustworthiness.The product rating computation algorithm that we present here is an extensionof our model proposed in [2]. The first extension employs the helpfulness factorof reviews on the product. In [2], only the ratings were taken into account, buthere we also consider how helpful the cast ratings are. This extension leads to amajor change in the results of computing the credibility of quality levels and thetrustworthiness of voters. The second difference is that in previous work, we usethe degree of conformance of the reviewer to community consensus as the level ofher trustworthiness, but in this work, we use it as just one of seven behavior factorsemployed in building trustworthiness.

Evaluation We have evaluated the performance of our algorithms by applyingthem on both a synthetic dataset and a dataset from Amazon’s online market andMovieLens movie rating system. The results shows the accuracy of our calculatedscores and the robustness of our method against massive manually injected maliciousactivities.

1.3 Organization

In Section 2 we describe an example scenario which motivates the problem andthe contributions. We present our rating algorithm in Section 3, behavior analysismethod in Section 4 and trust calculation method in Section 5. The final rating scoresof products are computed in Section 6. Section 7 describes the evaluation results. Wediscuss related work in Section 8 and conclude in Section 9.

2 Motivating scenario

Online rating and evaluation systems cover a very broad category of systems rangingfrom product and content rating systems to users’ trustworthiness assessment [15].We describe a use case scenario from a known evaluation system, Amazon online

World Wide Web (201 ) 1 : –85 73 109 77

Figure 2 Helpfulness of reviews in Amazon online market

marketplace,4 which we use throughout the paper to express how our method worksand also in its evaluation.

Amazon is a well-known online market in which sellers put products for sale.Buyers search for relevant products and make buying decisions based on criteria likequality and price. Buyers share their experiences including reviews and rating scores.Rating scores are numbers in the range one to five. Amazon generates an overallrating score for every product based on the rating scores cast by the buyers. Studies(e..g., those presented in [18]) show that the Amazon rating system has been subjectto widespread collusion and unfair ratings. Amazon allows people also to cast voteon the helpfulness of reviews posted by others. As shown in Figure 2, a summaryof people’s opinions is reflected in terms of the number of people who found thereview to be helpful. In the remainder of this paper, we will use this number as thelevel of helpfulness of a posted review. It should be noted, however, that, as reportedin [9], sometimes this metric may not conform to the real usefulness of the review.Our proposed techniques can make use of other helpfulness metrics, too.

We use the log of Amazon social rating system5 which was collected by Leskovecet al. for analyzing the dynamics of viral Marketing [24]. We refer to this log asAMZLog in this paper. This log contains more than 7 million ratings cast on thequality of more than 500, 000 of products collected in the summer of 2006.We willuse AMZLog to show how the helpfulness of reviews, when taken into account, canresult in more informative and robust rating scores and trust ranks. We will do moreanalysis on this dataset in Section 7.2.2.

3 Product rating method

In this section, we first discuss why and how we reduce the rating process to a votingproblem. Then we describe the setup of our model, basic concepts and notations.Later we present an iterative voting algorithm and show how we use the results ofthe algorithm to compute product rating scores. Finally, we also discuss the timecomplexity of the proposed algorithm.

4http://www.amazon.com5http://snap.stanford.edu/data/amazon-meta.html

World Wide Web (201 ) 1 : –85 73 10978

http://www.amazon.comhttp://snap.stanford.edu/data/amazon-meta.html

3.1 Reducing rating to voting

An election is the process of choosing one item from a finite list of items. Moreprecisely, four basic factors need to be identified in every election. The first factor isthe position for which the election is going to choose an occupant. The second factoris the set of candidates nominated for the target position. The next factor is the setof participants who contribute to the election by voting for their favorite candidate.The last factor is the voting aggregation mechanism which specifies how votes shouldbe aggregated towards choosing the winner of the election e.g., majority consensusor averaging.

Our current work is build on the basis of ‘reducing rating to voting’ which weproposed in [2]. In rating systems, we have a list of products to be rated by a group ofusers. Every user can express her opinion on the quality of a product by choosing thequality level which best describes the opinion of the user. Sometimes it is also possibleto extract a helpfulness level for every cast opinion rating explicitly or implicitlyfrom community feedback [9, 17]. In most rating systems the rating score assignedto a product is an (possibly weighted) average of the quality levels that raters havechosen.

In the ‘reducing rating to voting’ model, we look at the rating process as anelection. In such an election, the voters are the people who are rating the product.The candidates, for which we later use the technical term items, are all possiblevalues which a voter can choose to express her opinion about the product. Thesepossible values are usually provided in the form of classic Likert type scales [27], e.g.,one star to show low quality or five stars to show high quality. Since each of thesechoices reflects a particular level of quality, we call them quality levels. The winnerof the election will reflect the prevailing opinion of the community on the qualityof the product. Our voting model calculates for each quality level in each electionlist a degree of credibility. Using such calculated degrees, we assign each voter ascore reflecting to what extent she has behaved in accordance with the communitysentiment. We call this score degree of conformance to the community. In the nextround of iteration we employ these conformance degrees to recalculate credibilitydegrees for each quality level and so on. This interdependency between calculationof credibility degrees and conformance to community is the reason that we proposean iterative rating model.

But why do we reduce rating to voting? The main idea in most existing iterativerating models such as [22, 45] and [11–13] is to produce at each stage of iterationan approximation of the final ratings of the products and then calculate for eachrater the degree of her belief divergence from such calculated approximations, i.e.,calculate some distance measure between her proposed rating scores and theseapproximations of the final scores of products. In the subsequent round of iterationa new approximation of the rating scores of all products is obtained as a weightedaverage of the scores proposed by raters, with the weight given to each rater’s scoreinversely related to her corresponding distance from the approximate final scoresobtained in the previous round of iteration. Thus, in this manner, the rating scores ofproducts are produced simultaneously with an assessment of the trustworthiness ofthe raters as reflected in the weights given to their proposed scores.

This model of iteration is prone to manipulation. Assume that one of the par-ticipants in the rating process is aware of or can guess the process of calculating

World Wide Web (201 ) 1 : –85 73 109 79

approximated rating scores. She can cast a rating score very close to (or exactly thesame as) the approximated score. In this case, since the distance of her posted ratingscore from the approximated score is too small, in a few iterations her opinion willdominate others and the model will converge to her opinion.6 To solve this problem,we entirely decouple the credibility assessment from the score aggregation by usingthe voting model. We do not rely on any form of approximation. We first calculatea credibility degree for every quality level. Thereafter the user is free to aggregatethese credibility degrees in any form to build final scores. In this model, first, it is notpossible to cast a score to dominate others, and second, the model is more flexible,by allowing users to choose the aggregation model.

3.2 Model setup and basic notation

Assume that N voters V1, . . . , VN in a voting system participate in L elections�1, . . . , �L. On every election there exist nl many quality levels �l = {Il1, . . . , Ilnl }.Voters are asked to choose the best quality level on each list. Also assume that thereexists a helpfulness level corresponding to every cast vote, denoted by κrl which isthe helpfulness of a vote cast by r on list l. Not every voter is obliged to vote for thebest quality level on every list, but can choose to vote on a subset of these lists. Aswe explained earlier, every election represents one product, so we may use the wordslist, election and product interchangeably. Also, every item represents a quality levelthat can be assigned to a product regarding its quality. So, we may use terms item,rating level and quality level as synonym.

Based on voters’ choices in elections, we build a degree of conformance tocommunity sentiment, referred in the following as conformance, for every voter. Thisvalue reflects to what extent the user’s votes have conformed to the communitysentiment. We denote conformance of voter Vr by Tr. We also reflect how the votersin every list agree on a particular item by computing the credibility degree of the item.The credibility degree of item Ii on list �l is denoted by ρli. To calculate conformanceand credibility degrees, we define two new concepts: (i) benef it and (ii) gain.

Benef it When somebody chooses a quality level to reflect her opinion on the qualityof a product, this vote will impact the credibility degree of that quality level. Forexample, in a normal election, it increase the chance of the quality level by 1 to winthe election. In our model this impact is reflected as an increment to the credibilitydegree of the quality level. The amount of this increment depends on the degree ofthe conformance of the voter to the community consensus as well as the helpfulnessof the vote itself. We call this increment the benefit which the quality level takes fromcast votes and denote it by β. βli is the benefit of item Ii on list �l from all votes thathave been cast on it. We show in Section 3.3 how we calculate βli.

Gain On the other hand, we believe that reviewers should be rewarded based onhow their votes conform to community sentiment and on the helpfulness of theircast rating scores. The gain parameter is a metric to show how a user gains whenshe casts ratings which are helpful and concurrent by community sentiment. We use

6This attack scenario was designed by Mohsen Rezvani. His work has not been published yet.

World Wide Web (201 ) 1 : –85 73 10980

the concept of gain to simplify understanding of our model, but when we computeconformance degrees, we do not define it explicitly and it is just reflected in thecalculation formulas.

Notation In the following, let us introduce some more notation:

– r → li denotes the fact that voter Vr has participated in voting for the best objecton list �l and has chosen item Ii;

– for each item Ili on list �l we will keep track of its credibility degree at the stepof iteration p, denoted by ρ(p)li ;

– these individual credibility degrees ρ(p)li will be collected into a single vector

ρ = 〈ρli : 1 ≤ l ≤ L, 1 ≤ i ≤ nl〉;thus, if we let M = ∑1≤l≤L nl , then ρ ∈ RM;

– we define (ρ)l to be the projection 〈ρli : 1 ≤ i ≤ nl〉 of ρ to the subspacecorresponding to a single list �l .

– for each voter Vr we will also keep track of her conformance T(p)i at the stage of

iteration p.– for each k ≥ 1 we denote by ‖x‖k the usual k-norm of the vector x = 〈x1, . . . , xn〉,

i.e.,

‖x‖k =(

n∑

i=1xki

) 1k

.

3.3 Iterative voting algorithm

Our proposed algorithm is a weighted algorithm, i.e., the impact of the vote ofpeople in the result of an election depends on the conformance of the voter tothe community consensus and the helpfulness of the vote. We use the followingremarkably old idea to build the degree of conformance of the voters to communitysentiment:

Judge not, that you be not judged. For with what judgment you judge, you willbe judged; and with the measure you use, it will be measured back to you.7

The intuitive motivation behind our method is that people when they evaluateothers, implicitly evaluate themselves. If the community sentiment approves thechoices a voter has made, the voter will be deemed more honest than the users whoare far away from community sentiment. By community sentiment approval we meanto what extent the other members of the community have cast votes similar to aparticular voter.

Figure 3 shows a schematic view of our algorithm. The algorithm starts withinitializing conformance degrees to 1. Then, the algorithm initializes credibilitydegrees for every election (in this case every product) based on initial values ofcredibility degrees. The next part is the iterative part of the algorithm. We firstcalculate the benefits of every item on every list and, using this benefits we calculate

7The New Testament, Matthew 7:1–2

World Wide Web (201 ) 1 : –85 73 109 81

Figure 3 An overview of our iterative voting algorithm

the credibility degree of every item on each list. In this stage we compare recentcalculated credibility degrees with the ones calculated in the previous iteration. Ifthe amount of change is too small (they converged) the iteration stops, otherwisethe next iteration is initiated. As the credibility degrees are interdependent with thedegree of conformance of their corresponding voters to community sentiment, weupdate the degree of conformance of voters based on the new calculated credibilitydegree of their choices and the helpfulness of their votes. Using new conformancedegrees we recalculate the benefits and continue this process until the credibilitydegrees stop changing significantly.

Algorithm 1 is our voting algorithm. When we say a review is, for example 70 %helpful, it simply means that when making a decision, one can 70 % rely on thisreview, but what about the other 30 %? Since helpfulness shows the communityopinion on how helpful the review is, we can simply rely on the other members ofthe community who have voted on this product. In other words, when a reviewervotes on a quality level and the level of helpfulness of the vote is identified to be κrl ,it means that with the confidence of κrl the community relies on this reviewer andwith 1 − κrl on the rest of the community who have evaluated this product. So, thequality level Ii on the list �l should benefit from all votes cast on �l , but this benefitmust depend on its degree of credibility and the helpfulness of the votes. This fact isrepresented in the computation of credibility degrees in Algorithm 1. The first part

of the formula, i.e.,∑

r : r→li κrl(

T(p)r)α

reflects the benefit of the quality levels from

the votes cast on them and the second part, that is∑

r : r→l(1 − κrl)(ρ

(p)li

)m (T(p)r

)α

shows how a quality level can benefit from the votes cast on other levels. By the sameexplanation, when we calculate the degree of conformance of a voter, we employ allcredibility degrees.

In this algorithm, α and m are two constants which control the execution of thealgorithm. The constant α which we call discrimination factor is being used to controlto what degree the outliers should be marginalised. The larger the value of α, themore users will be marginalised. A large value for α makes the algorithm more robustagainst outliers, but it also increases the possibility of marginalising honest users.

World Wide Web (201 ) 1 : –85 73 10982

Algorithm 1 Adaptive voting algorithmInitialization: Let ε > 0 be the precision threshold, α ≥ 1 a discrimination settingparameter and

T(0)i = 1;

ρ(0)li =

∑r : r→li T(0)r κrl√

∑1≤ j≤nl

(∑r : r→l j T

(0)r κrl

)2.

Repeat:

β(p+1)li =

∑

r : r→liκrl

(T(p)r

)α +∑

r : r→l(1 − κrl)

(ρ

(p)li

)m (T(p)r

)α ;

ρ(p+1)li =

β(p+1)li√

∑1≤ j≤nl

(β

(p+1)l j

)2;

T(p+1)r =∑

l,i : r→liρ

(p)li κrl +

∑

l : r→l

⎛

⎝∑

1≤ j≤nl

1 − κrlm + 1

(ρ

(p)l j

)m+1⎞

⎠ ;

until: ‖ρ (p+1) − ρ (p)‖2 < ε.

The second constant, m, is being used to control the distribution of the weight of therater to all items in a list because of the level of helpfulness of her vote. In otherwords, the benefit of a list on a product comes from two sources. The first is thedegree of conformance of the voters who have chosen this particular quality level.This part of the benefit is dependent on the helpfulness of the vote which is cast onthe quality level. The second part of the benefit comes from the conformance degreeof all people who have voted on this list. This portion of the benefit is related to thelevel of non-helpfulness (1 − κrl) of the cast votes and the conformance level of theircorresponding voters. The amount of this portion of the benefit is also dependenton the credibility degree of the item itself. The higher the credibility degree of aquality level, the larger its benefit from the community consensus. The constant mis a parameter that controls the amount of this benefit. The larger the value of m,the smaller the benefit of the items from the conformance degree of voters who havevoted on other items in the list. For thee same reason, we control the gain of the usersfrom the list they voted on by choosing a suitable value for m. In our implementation,we use α = 2 and m = 3.

The convergence of the algorithm is guaranteed with the existence of a fixed pointof a continuous mapping, which also happens to be a stationary point of a constrainedoptimization objective function. The proof of convergence of our iterative algorithmis presented in Appendix.

World Wide Web (201 ) 1 : –85 73 109 83

3.4 Calculating initial rating scores

Let πl represent the product corresponding to voting list �l , 1 ≤ l ≤ L. Every votinglist consists of a set of quality levels and their corresponding credibility degrees. Wecan now obtain a rating score R(�l) of each product πl , using such credibility degrees(ρ)l = 〈ρ1, . . . , ρn〉 in a way which suits the particular application best. For example,if the rating scores have to reflect where the community sentiment is centered, we cansimply choose as the rating score R(�l) of �l the rating level which has the highestcredibility degree. Such a rating score does not involve any averaging and is mostindicative of the community’s prevailing sentiment. On the other hand, if we wish toobtain a score which emphasizes such prevailing sentiment, but, to a varying degreetakes into account “dissenting views”, one can form a weighted average of the form

R(�l) =∑

1≤i≤nl

ρqli × i∑

1≤ j≤nl ρqlj

.

where q ≥ 1 is a parameter. As q increases, such rank converges to the previous,maximum credibility degree, while for smaller values of q we obtain a significantaveraging effect. In our implementations and testings we have used q = 2.

3.5 Time complexity of the algorithm

Algorithm 1 has two main parts, one responsible for calculating credibility degreesand the other responsible for calculating conformance degrees. In terms of numberof iterations, the worst case in calculating βli happens when everybody has votedin every election. So, the number of iterations will be 2 × the number of reviewers.Consequently, the time complexity for calculating the benefit of every level on everylist will be O(L). Moreover, this repetition is done for every item of every list, so thetime complexity becomes O(L × nl × N).

On the other hand, the number of levels on every list, i.e., nl , is very small (usuallyless than 10) in comparison with the number of products and users (millions of usersand products), so we can approximate time complexity with O(L × N). It is easyto show that the time complexity of computation of conformance degrees is alsoO(L × N).

The algorithm will iterate until it converges, say k times. The k is very small incomparison with L, N or L × N. In our experiments the algorithm always convergedin less than 100 iterations. Therefore, with enough large values for L and N we canignore k and finally say that the time complexity of our algorithm is approximatelyO(L × N).

It is obvious that when everybody votes on every product, it is impossible toachieve a better time complexity. It is noteworthy that our algorithm with this timecomplexity calculates the degree of conformance of users to community consensus aswell as the credibility degrees of list items while it is also marginalising colluders (seemore details in Section 7). Most of the existing collusion detection techniques rely onclustering and clique detection techniques which are NP-Hard algorithms, but ouralgorithm with an acceptable time complexity removes the impact of collusion andcalculates trust ranks and credibility degrees simultaneously (see Section 7).

World Wide Web (201 ) 1 : –85 73 10984

4 Behavior analysis

In the previous section we calculated a rating score for every product. Simultane-ously, we calculated a rank for every reviewer reflecting to what degree the reviewerconforms to the community sentiment. While conformance to the community senti-ment shows the weight of the votes of the reviewer, it is not suitable to use it as theonly indicator to the real nature of the behavior of the user. In other words, when auser has a low conformance to community, it does not necessarily mean that she is anuntrustworthy user, because she may have cast her votes honestly, but just that theywere different from the community consensus due to, for instance, lack of expertise.On the other hand, high conformance may be the result of trying to look trustworthyby casting votes close to community consensus to build a fake trust rank.

In this section we introduce more indicators which can help us to evaluate thebehavior of people more precisely. These factors altogether build a behavior vectorfor every reviewer. In the next section, we map this vector to a value which reflectshow the reviewer has behaved in the system, i.e., the trustworthiness of the reviewer.The behavior of a reviewer in a rating system can be characterized by several factors,each of which reveals a particular aspect of the behavior. In the below, we definethese factors and show how we calculate them.

4.1 Conformance to community consensus

This factor shows to what extent the behavior of a person conforms to the prevailingsentiment of the community. The degree of conformance is described in previoussections and is calculated by our proposed voting algorithm in Section 3. The degreeof conformance of reviewer r can be any number greater than or equal to zero. So,we map every Tr to a number in the range [0 − 1] to make them comparable in allcontexts. In order to map each Tr to the range [0 − 1], we find the maximum valueof Tr, 1 ≤ r ≤ N, then divide each Tr by this max. In other words:

Tr = Trmax(Ti) where 1 ≤ i ≤ NWe will reuse this method for mapping the value of other behavior factors to the

range [0 − 1] in the next sections, when necessary.The resultant degree of conformance is a number between 0 and 1 which relatively

reflects how small or large the degree of conformance is. The value 1 means thereviewer highly conforms to the community sentiment and the value 0 means noconformance.

4.2 Level of contribution

A very important factor of the behavior of a reviewer in the system is the extentof her activity in the system, i.e., how many votes she has cast. The significanceof the number of votes is revealed only when it is compared with the number ofvotes each other reviewer has cast. For instance, if a reviewer has cast 100 votes in acommunity in which all other voters cast less than 50 votes, this reviewer is a highlyactive member of the community. Alternatively, suppose that all other membersof such a community have posted at least 500 votes. In this case this reviewer is a

World Wide Web (201 ) 1 : –85 73 109 85

relatively inactive member. To resolve this issue we assign every reviewer r a level ofcontribution, a relative value denoted by ςr. Suppose that the number of reviews castby a reviewer r is denoted by Vr and N is the number of reviewers.

In Section 3, for every product �l we calculated a rating score R(�l) which reflectsthe community judgement of the quality of the product, and we use it as the truerating score of the product. Once this rating score is calculated, we can check howreviewers deviate from this truth level. Suppose that reviewer r has evaluated product�l with the rating score vrl . The absolute value of R(�l) − vrl reflects deviation of rwhile evaluating �l . The sum of all such deviations constructs an error rate for thereviewer reflecting his overall deviation from the community sentiment. We denotethis deviation by ABS(r) and calculate it as follows:

ABS(r) =∑

l:r→l|R(�l) − vrl|

We map each ABS(r) to the range [0..1] , denote the result by θabs(r) and call itthe absolute error rate. The higher the value of θabs(r), the lower the accuracy of thereviewer when evaluating products.

We also define and calculate another error rate for every reviewer. This error rateis the accumulation of their deviation from real rating scores. The only differencewith the absolute error rate is that we do not use the absolute value of the differencebut use the signed values. Therefore, for reviewer r we define ACC(r) as follows:

ACC(r) =∑

l:r→lR(�l) − vrl

We then project the value of error rate to range [0, 1] to obtain a cumulative errorrate, denoted it by θacc(r). Small values of θacc(r) mean that the reviewer has votedrandomly, so some of the differences are positive and some of them are negative andthe result is small. Also, the value can be small because of the high accuracy of thereviewer. High values of θacc(r) mean that the reviewer is biased and has cast scoresthat are always lower (or always higher) than the real value of the product ratingscore. The value of θacc(r) is always smaller or equal to θabs(r), i.e., θacc(r) ≤ θabs(r).

We assign each reviewer an error rate which is the average of the absolute andthe cumulative error rates. The error rate of the reviewer r is denoted by θr andcalculated as follows:

θr = θabs(r) + θacc(r)2

5 Trust calculation method

Trust is a cognitive and subjective concept which is dependant on the context inwhich it is being used. Being trustworthy or not in different contexts may havedifferent meanings. A person who has, say, trust rank 60 %, in a Q&A systemmight be considered trustworthy but in a strict financial system she might be treatedas malicious because only people with trust values higher than, for instance, 85 %are considered trustworthy. This uncertainty is also the case while dealing with thebehavioral factors of a reviewer. Let’s explain this with an example. When evaluatingreviewers, if we spot a reviewer with a very low level of conformance to community,

World Wide Web (201 ) 1 : –85 73 10986

Figure 4 Fuzzy inference system used for calculating Trust Scores

we can classify her as a possible outlier. But the challenging question is: what is thereal range of numbers which can be deemed small?

In reality, we generally do not use crisp numeric values to assess trustworthinessor other aspects of the behavior of a person but we use linguistic terms like smalland large, having a common understanding of their meanings. To build a realistictrust rank for reviewers we imitate this cognitive approach. We convert the numericvalues we calculated for behavioral factors to linguistic terms and use them to reasonabout the trustworthiness of reviewers.

Our approach employs fuzzy logic to do these conversions and calculate a com-prehensive trust rank for every reviewer. A fuzzy inference system (FIS) consists ofseveral parts illustrated in Figure 4. Explaining the details of fuzzy logic is outsidethe scope of this work and one can find more details in [32]. We will provide somebasic information about different parts of fuzzy inference systems while we explainits application in calculating trust ranks.

5.1 Fuzzifier

In a fuzzy inference system, we have a set of crisp input variables which shouldbe processed. In our system these variables are the behavioral factors of reviewers.Corresponding to every input value we define a linguistic variable usually with thesame name as the input variables e.g., helpfulness, Absolute_error_rate, etc. Everylinguistic variable contains a set of linguistic values corresponding to every inputvariable, for example Helpf ulness = {small, medium, high}. Every linguistic variableis represented by a function called membership function and is denoted by μ. Theoutput of a membership function is a real number in the range [0, 1] which reflectsthe level of membership of an input variable to a linguistic variable.

The fuzzifier uses these membership functions to convert crisp input variables tofuzzy linguistic variables and vice versa. In our model we have seven input variableswhich are behavioral factors of reviewers and one output variable which is thetrust rank that should be calculated for every reviewer. We have defined the samemembership function for all input variables which is depicted in Figure 5a. Themembership function of the trust as the output function is defined in Figure 5b.

One important characteristic of fuzzy logic which fits the uncertain nature of trustand behavioral factors is that an input value does not have to be fuzzified using onlyone function. For example, the value 0.26 for helpfulness can be considered bothsmall and medium at the same time but with different degrees of membership.

World Wide Web (201 ) 1 : –85 73 109 87

Figure 5 Membership functions of input and output linguistic variables

5.2 Inference engine

The role of the inference engine is to convert fuzzy inputs to fuzzy output. Thisconversion is done using a set of IF-THEN type rules called fuzzy rules. A fuzzyrule specifies the condition in which a set of fuzzy inputs can be mapped to an outputfuzzy variable. For example a fuzzy rule can be as follows:

IF degree_of_conformance IS high AND Error_Rate IS low THEN trust IS high.

Since we have 3 input variable and each of which can have 3 different values,we will have 33 = 27 different combinations. Each combination can potentiallyrepresent a particular level of trustworthiness. According to a specific applicationand demanded level of accuracy one can define a fuzzy rule for all or some of thesecombinations.

This method of trust calculation is one of the strong points of our proposedmethod. As we explained earlier, the definition of honesty and dishonesty can bequite different in different contexts. In a system like a security system the definitionof trust is completely different from a general purpose social network or a questionanswering forum. Therefore, we do not define strict rules for trust computation. Weenable system users to customize their trust definition by defining any number offuzzy rules they need with any form of combination of behavioral factors. Thesefuzzy rules are defined based on the intuition of trust in the target community. Thefollowing are some example brought to show how the concept of trust can lead todefining fuzzy rules.

World Wide Web (201 ) 1 : –85 73 10988

Example Assume a reviewer who has a high conformance with the community. Thisreviewer also has a small error rate and a large number of reviews have been castedby her. We can consider such a reviewer a person with a high level of trust. This canbe reflect in a fuzzy rule as follows.

IF T IS high AND θ IS low AND � IS high THEN τ IS high.

Since we have 3 input variable and each of which can have 3 different values,we will have 33 = 27 different combinations. Each combination can potentiallyrepresent a particular level of trustworthiness. According to a specific applicationand demanded level of accuracy one can define a fuzzy rule for all or some ofthese combinations. We have defined 27 rules for all possible combinations andillustrated them in Table 1. Recall that these rules are brought as sample rules andare completely customizable depending on the requirements of the application.

This method of the trust calculation is one of the strength points of our proposedmethod. As we explained earlier the definition of honesty and dishonesty can bequite different in different contexts. In a system like a security system trust definitionis completely different from a general purpose social network or a question answer-ing forum. Therefore, we do not define strict hard wired rules for trust computation.We enable system users to customize their trust definition by defining any number

Table 1 The set of definedfuzzy rules in our fuzzy logicsystem (VL = Very Low, L =Low, M = Medium and H =High, VH = Very High)

Behavioral factors (Input) Trust value (Output)

Rule no. T ς θ τ

1 L L L L2 L L M L3 L L H VL4 L M L L5 L M M VL6 L M H VL7 L H L L8 L H M VL9 L H H VL10 M L L M11 M L M M12 M L H VL13 M M L M14 M M M M15 M M H L16 M H L M17 M H M L18 M H H VL19 H L L VH20 H L M VH21 H L H M22 H M L H23 H M M H24 H M H M25 H H L M26 H H M L27 H H H L

World Wide Web (201 ) 1 : –85 73 109 89

of fuzzy rules trey need with any form of combination of behavioral factors. Thesefuzzy rules are defined based on the intuition of trust in the target community.

For arriving to the trust ranks, inference engine should evaluate all these fuzzyrules and check if they match the reviewer’s behavior and then compose the resultsof these evaluations to define output zones for the output variable trust. In literature,different methods are proposed for evaluating fuzzy rules [32]. Our inference engineemploys max-min composition method.

5.3 Defuzzifier

When evaluating a reviewer in the inference step, based on different values ofinput fuzzy variables, several fuzzy rules might be activated in same time and theresult is a set of linguistic output values which are included in output with differentlevels of membership. For example, the trust might be low with membership of0.2, medium with membership of 0.4 and high with membership of 0.7. The role ofdefuzzifier is to combine these different levels and membership and build a crispvalue reflecting the trustworthiness of the reviewer. Several methods are proposedfor this aggregation [41]. We employ the Center of Gravity(CoG) method whichcalculates the center of gravity of all areas of trust membership function whichhave been assigned to trust by inference engine. CoG is probably the most popularaggregation method which is also quick and accurate in computations [41].

The output of defuzzifier is trust ranks of reviewers which are real numberbetween 0 and 1. The values close to 1 represent users which their behavior showshigh trustworthiness based on our designed model. Low values for trust show usersthat our reviewer evaluation method did not deem them trustworthy.

6 Calculating final product rating scores

In this section we use the trust ranks calculated by our trust calculation engine,proposed in previous section, to compute the final rating scores of the products.To do so, we use the credibility degrees of quality levels calculated ny our voteaggregation algorithm along with the trust ranks of reviewers and calculate finalrating scores of products in following two steps.

Step 1 We recompute the credibility degrees of quality levels for each product asfollows:

β(p+1)li =

∑

r : r→liκrl

(τ (p)r

)α +∑

r : r→l(1 − κrl)

(ρ

(p)li

)m (τ (p)r

)α ;

ρ(p+1)li =

β(p+1)li√

∑1≤ j≤nl

(β

(p+1)l j

)2;

World Wide Web (201 ) 1 : –85 73 10990

Step 2 In the second step, we use the new computed credibility degrees to computethe rating scores of products as follows:

Rate(�l) =∑

1≤i≤nl

ρqli × i∑

1≤ j≤nl ρqlj

.

We will show in the evaluation section that this substitution increases the accuracyof the calculated products rating scores.

7 Experimentation and evaluation

We now evaluate our proposed method. Our algorithm takes into account thehelpfulness of the cast reviews, the values of the cast reviews and also the trust-worthiness of the persons who have cast such reviews. Since none of the existingrating algorithms does not take into account helpfulness parameter, we performour experimental evaluations in the following manner. We use our basic algorithmproposed in [2] to compare with other related work. This algorithm is illustratedin Algorithm 2. Our basic algorithm does not take into account helpfulness, soits comparison with other related work is more fair. In this part, we first conductexperiments in order to find optimal values for algorithm tuning parameters andthen use these values in the algorithm for comparison purposes. In the next stepof evaluation we show how incorporating helpfulness improves performance of ouralgorithm. Finally, in the third step of evaluation, we assess our trust calculationengine and show how the accuracy of calculated rating scores increases when thetrust ranks are employed rather than the level of conformance to community.

7.1 Evaluating the basic algorithm

Our algorithm has two main parameters which help users to tune the algorithm toobtain a behavior best suited for a particular application: α and q. To be able tocompare our algorithm with the related work from the literature, we conductedexperiments in order to find optimal values for α and q and then use these valuesin the algorithm for comparison purposes. In the last part of this section, we evaluatethe accuracy of our calculated rating scores using a real-world dataset.

7.1.1 Datasets

We use a synthetic dataset for the purpose of analyzing the behavior of our algorithmwith different values of α and q, and also for performance comparison. There aretwo reasons for using a synthetic dataset. First, we need ground truth levels forthe purpose of accuracy measurement. None of the existing real-world datasetsprovide such ground truth levels. Second, we need to evaluate the performance ofour algorithm in different scenarios, such as having evaluators with different errorrates or in the presence of collusion attacks. Since we do not know the real behaviorof users in real-world datasets, they are not suitable for conducting such experiments.

World Wide Web (201 ) 1 : –85 73 109 91

Algorithm 2 Basic vote aggregation algorithm (Proposed in [2])Initialization: Letε > 0 be the precision thresholdα ≥ 1 a discrimination setting parameter, and

T(0)i = 1; (1)

ρ(0)li =

∑r : r→li T(0)r√

∑1≤ j≤nl

(∑r : r→l j T

(0)r

)2

= |{r : r → li}|√∑1≤ j≤nl |{r : r → l j}|2

. (2)

Repeat:

T(p+1)r =∑

l,i : r→liρli; (3)

ρ(p+1)li =

∑r : r→li

(T(p+1)r

)α

√∑

1≤ j≤nl(∑

r : r→l j(

T(p+1)r)α)2

; (4)

until: ‖ρ (p+1) − ρ (p)‖2 < ε.

We use the MovieLens 100k dataset8 as a guide to generate our synthetic datasets.More precisely, we extract the following statistical parameters from the MovieLensdataset: (i) the number of evaluators, (ii) the number of movies, (iii) the statisticaldistribution of the number of votes per evaluator, and (iv) the statistical distributionof the number of votes per movie. The dataset contains 943 evaluators and 1, 682movies. So, we generate a dataset with exactly the same number of evaluators andmovies. We assign each movie a “true value” uniformly randomly selected fromthe range [1, 10] except for the last movie which will be the target of a collusionattack. We let the last movie have the quality of 1 out of 10. Similarly, we assign eachevaluator r a standard deviation σr randomly and uniformly in the range [1, 5].

Using the standard statistical estimation tools provided by Matlab® softwarepackage, we have determined that in the MovieLens dataset, distribution of thenumber of votes per movie fits a random variable X = βeta(0.5753, 8.4141) withthe β distribution. Similarly, distribution of the number of votes per evaluator fitsa random variable Y = βeta(1.3247, 19.5008). We generate the dataset so that each

8http://www.grouplens.org/node/12

World Wide Web (201 ) 1 : –85 73 10992

http://www.grouplens.org/node/12

movie is rated with at least 20 evaluators and each evaluator has rated at least 20movies. For each evaluator, we obtain the number of movies that she will rate as avalue of the random variable Y, denoting it by kr and for each movie we randomlyselect number of votes it can receive using the random variable Y; the particularmovies each voter will rate are selected randomly from such a pool of total availablevotes movies can receive. We assume that each evaluator r votes honestly but withaccuracy corresponding to their standard deviation σr. So, the rating score given byan evaluator to a movie is generated using a normal distribution with the mean equalto the “true rating” of the movie and the standard deviation σr. All cast evaluationsare rounded to be in the range of 1 to 10.

We use this dataset as a base for generating collusion included datasets. Weassume that in a collusion attack, a group of Nc users collaborate to boost the ratingscore of the last movie from 1 to 10. Such colluders first randomly choose 106 moviesand cast evaluations on them to build a reputation history. Thereafter, colluders givethe rating 10 to the last movie. The number 106 is calculated based on the mean ofthe distribution of votes per evaluator in MovieLens dataset and the total number ofmovies number. The mean of the βeta(1.3247, 19.5008) is 0.0636 and the total numberof movies is 1, 682, so, in average, each evaluator evaluates 0.0636 × 1, 682 ≈ 106movies. The evaluations given by colluders to the chosen movies are normallydistributed around the true value of the movie with the standard deviation of σ .

Based on the value of σ , we design two collusion scenarios. In the first scenario, σis the the maximum of all standard deviations assigned to evaluators. In other words,colluders behave similar to the least accurate members of the community. In thesecond scenario which is harder to detect, colluders vote like an average evaluator.More precisely, they vote with σ equal to the average of all standard deviationsassigned to evaluators. In both scenarios, we change the size of the attack, i.e., Nc,from 10 % to 100 % of the number of evaluations cast on the last movie.

To decrease the impact of random generation of data on the results, we generate100 different base datasets and then use them to generate datasets which includea collusion. Thus, the result of an experiment is, in fact, the mean of the resultscalculated for all 100 datasets. All experiments are conducted on a Windows 7, 64bit machine with Intel Core 2 Duo CPU with 4GB memory. All algorithms areimplemented in MATLAB R2012b .

7.1.2 The basic algorithm behavior analysis

The parameter α is responsible for tuning the behavior of our algorithm when cal-culating credibility levels. It defines the level of discrimination which our algorithmapplies to marginalise people who deviate from the community sentiment. The largerthe value of α, the more discriminative the algorithm is. On the other hand, q impactsthe nature of the calculated ranks; as already mentioned, higher values for q result inconvergence of the rating scores towards the levels with the maximum credibility.

In order to find suitable values for α and q, we have tested our algorithm onsynthetic data with different values and measure the accuracy of our algorithmaccordingly. Specifically, we change the value of α from 0.5 to 4 with the step of0.5, and the value of q from 0.1 to 3.0 with the step of 0.1. The result is presented inFigure 6. As shown in this figure, the highest accuracy levels are obtained when theq value is around 1.5 and for 1.8 ≤ α ≤ 2.8. Thus, in our subsequent experiments weuse such empirically obtained values q = 1.5 and α = 2.

World Wide Web (201 ) 1 : –85 73 109 93

Figure 6 Analyzing thebehavior of our basicalgorithm [2] with differentvalues of α and q

7.1.3 The basic algorithm performance comparison

In this section, we compare the performance of our algorithm with two other state ofthe art algorithms. The first is proposed by De Kerchove et. al. in [11] and completedin [13], referred to as DeKerchove. In this algorithm, an affine function [13] isemployed as the discrimination function. To the best of our knowledge, this is oneof the most robust algorithms in the area. The second algorithm for comparison isone of the pioneering iterative algorithms, proposed by Laureti and his colleagues,referred to as Laureti [22]. More details about these related work are presented inSection 8.

We use the Root Mean Square (RMS) error as the performance comparisonmetric:

RMS =√∑n

i=1(xi − yi)2n

where n is the number of movies rated. We fist apply each algorithm to a datasetand compute the movies rating scores. We find the root mean square of deviation ofthese rating scores from the true values which were used to generate our syntheticdataset. The overall RMS error of an algorithm is the average of all errors obtainedby applying the algorithm to all 100 dataset. Smaller RMS errors mean that the rankswere closer to the true values and consequently indicate a higher accuracy.

Figure 7 shows the results of applying our algorithm alongside the two relatedalgorithms on synthetic datasets explained earlier in Section 7.1.1. The illustrationin Figure 7a shows the RMS errors of algorithms when error rates of colluders isthe maximum of all error rates in the community. In other words, colluders behavesimilar to the most erroneous evaluator in the community. The second illustration,i.e., Figure 7b, represents the RMS errors of all three algorithms when dealingwith “smarter” colluders, who behave like an average person in community (thestandard deviation of the colluders equal to the average standard deviation in thecommunity). The charts show that our model outperforms both DeKerchove andLaureti algorithms by generating smaller RMS errors in both experiments. It isnotable that collusion detection becomes a real challenge for any rating system

World Wide Web (201 ) 1 : –85 73 10994

Figure 7 Comparing Robustness of our algorithm with related work

when the number of malicious reviewers is extremely higher than the number oftrustworthy reviewers.

7.2 How helpfulness improves system’s behaviour

In this section we show how the helpfulness of the reviews changes the results ofthe main algorithm i.e., the calculated product rating scores as well as the degree ofconformance computed for reviewers. We first use a simple synthetic dataset to showwhat is the expected behavior of the system in the presence of degree of conformanceand helpfulness of reviews. We then employ the model we used in Section 7.1.1 tobuild synthetic datasets using the Amazon log and use them to evaluate performanceof our algorithm in the presence of helpfulness.

7.2.1 A simple evaluation scenario

We start evaluation of our algorithm with a very simple example, which neverthelessillustrates the heuristics which was the starting point for our algorithm. Table 2ashows the votes cast by 5 voters (r1, . . . , r5) on 5 items (I1, . . . , I5) in 5 differentelections (�1, . . . , �5). For example, the first row shows that in the first election r1,

World Wide Web (201 ) 1 : –85 73 109 95

Table 2 A simple example forshowing the fundamentaldynamics of our proposedmodel

(a) DatasetElection \Voter r1 r2 r3 r4 r5�1 1 1 1 2 2�2 1 2 2 3 2�3 3 4 4 4 2�4 1 2 3 3 4�5 1 2 2 1 1(b) ResultsItem \Election �1 �2 �3 �4 �5I1 0.98 0.03 0 0.05 0.20(0.83)I2 0.16 0.99 0.04 0.59 0.98(0.55)I3 0 0.11 0.03 0.80 0I4 0 0 0.99 0.06 0I5 0 0 0 0 0

r2 and r3 have voted for I1 and r4 and r5 have voted for I2. We first assume thathelpfulness of all votes are same and equal to 1.0.

In a usual election model, in the last election I1 must win because it has received 3votes out of 5. However, when we look at the history of the voters’ choices, r2 and r3,who have voted for I2 in last election, have voted for the winners in almost all pastelections. r2 has voted for winner in 4 elections and r3 in 3 elections. This means thatthey have been always behaving close to community sentiment. On the other handr1, r4 and r5 who have voted for I1 do not conform to community sentiment in mostof the past elections. Therefore, we can argue that in the last election the I2 shouldwin, despite the fact that the majority have voted for I1. Table 2b shows the results ofrunning our model on data provided in Table 2a. The initial values of I1 and I2 itemsare also provided in brackets to show how the winner of the election has correctlychanged.

This change in the result of the last election occurred because of the changes whichhappened in the conformance degrees of the users in the different iterations of thealgorithm. At the beginning, every user has the conformance degree equal to 1.0.Then based on the helpfulness of their votes and how they conform to communityconsensus their degrees of conformance change. Figure 8a shows how these changeshappen.

Now, one may argue that people can vote and then ask their friends to comeand vote on helpfulness of their cast votes. In this way a user can build herself ahigh degree of conformance with community. It should not be possible unless userhas votes helpful and close to community in all previous elections. To show thiswe choose the reviewer r1 and change the helpfulness of her votes from 0.0 to 1.0step by step in all elections and check its impact on the degree of conformanceof others as well as herself. As illustrated in Figure 8b, changing helpfulness ofreviews of Reviewer1 does not create any major changes in the conformance degreeof other users. The interesting point is that with increasing helpfulness of reviews ofReviewer1 her degree of conformance decreases slightly. This happens because thevotes cast by Reviewer1 are not supported by others, so she gains a few credit fromhis votes. On the other hand, since she has high degree of helpfulness her gainedcredit from community consensus ( related to 1- κrl) is also decreased. Therefore, our

World Wide Web (201 ) 1 : –85 73 10996

Figure 8 How changing helpfulness impacts the conformance degrees of reviewers in a community

model guarantees that users cannot build themselves fake degrees of conformanceto the community.

On the other hand, the changes in helpfulness of the reviews casted by userswho strongly conform to the community should result in major changes in theconformance degrees of all users. Because these users represent the consensus of thecommunity and any changes in the weight of their votes may change the sentiment ofthe community. The votes cast by reviewer Reviewer1 almost always fully conformsto community consensus, so she has to have a high degree of conformance. To check

World Wide Web (201 ) 1 : –85 73 109 97

how the level of helpfulness of the reviews cast by Reviewer2 impacts others wechange her helpfulness from 0.0 to 1.0 and measure the impact of this change. Theresults of this measurement is illustrated in Figure 8c. At the beginning, becausethe Reviewer2 has a low level of helpfulness, her votes has a small impact on theresult of others and she is considered as an outlier, but when we increase her levelof helpfulness everything change. Reviewer3, the other user with high conformancelevel gains more credit because of similarity of her votes with votes cast by Reviewer1.Also, degrees of conformance of the users with different opinions namely Reviewer4and Reviewer5 sharply decrease. Therefore in our model, a user can build herself ahigh degree of conformance only by voting in accordance with community sentimentand casting votes which are considered by other community members as helpful.

7.2.2 A comprehensive evaluation scenario

In this step of evaluation, we use the log of Amazon online market, referred to asMAZLog. This log contains more than 7 million ratings cast on the quality of morethan 500 thousands of products collected in the summer of 2006. We use a subset ofthis dataset which contains reviews cast on products of type ‘Music’. This subset ofdata contains more than 80,000 reviews cast on about 18,000 products rated by morethan 318,000 reviewers.

We use exactly a same method we used earlier in Section 7.1.1 to generate 10datasets which statistically completely conform to the AMZLog. We just reduce thenumber of products to 1,000 and number of reviewers to 16,000. We also use thedistribution if level of helpfulness to assign each reviewer a level of helpfulness.

We use this set of synthetic datasets to check how helpfulness of a reviewer canimpact the accuracy of the results calculated by our algorithm. To do so, we first,randomly choose a reviewer from the dataset who has a small error rate (error rate= 1) and has cast relatively a high number of reviews (compared to other reviewers).We call this reviewer, a Trustworthy reviewer. Again, we randomly choose a reviewerfrom the dataset with a large error rate (error rate = 5) and a large number ofcast reviews and call her Malicious reviewer. For the both trustworthy and maliciousreviewers, we change the level of helpfulness of their cast reviews from 0 to 1 andmeasure the changes happen in the rating scores of the products which they haveevaluated. To compute the RMS we compare the rating scores of product calculatedwith the new level of helpfulness with their true value.

We followed a method similar to the method used in Section 7.1.2, to analyze thebehaviour of our main algorithm. The results of this analysis showed that the valuesα = 2 and m = 3 generate more optimum results.

Figure 9a shows that increasing level of helpfulness of trustworthy users directlydecreases the RMS error of the results. So, the best way for a trustworthy user toboost her level of trustworthiness is to post reviews which have the potential ofbeing considered as ‘helpful’ by the community. Therefore, the model encouragesreviewers to be trustworthy and helpful. Figure 9b depicts that a malicious reviewercan not gain reputation with changing helpfulness of her reviews, e.g., by asking herfriends to deem her reviews as helpful. The chart clearly shows that when the levelof helpfulness of reviews cast by a malicious reviewer come close to 1, they createhigh RMS errors. So, the user’s error rate behavioral factor increases and will beidentified by the behavior analyzer. Therefore, malicious reviewers can not gain trustby increasing the level of helpfulness of their cast reviews.

World Wide Web (201 ) 1 : –85 73 10998

Figure 9 Impact if changing Helpfulness of a trustworthy and a malicious reviewer on the accuracyof the results

In the Section 7.1 we demonstrated that our basic algorithm performs betterthan some of the well-known existing ranking systems. The results of this sectiondemonstrate that the performance of the system increases by incorporating thehelpfulness of reviews. Therefore, the performance of the algorithm in the presenceof the level of helpfulness parameter will be better than the existing related work.

7.3 Evaluating trust calculation method

In the previous sections we demonstrated the superiority of our algorithm over theexisting related work. Our algorithm in the previous section, employs the level ofconformance along with the level of helpfulness to calculate the rating scores. Inthis section we show that substituting the level of conformance with the trust rankscalculated by our fuzzy trust calculation method increases the accuracy of the resultsand consequently improves the performance of the algorithm.

In order to show this improvement, we apply our main algorithm to the datasetsgenerated in Section 7.2.2 and calculate the RMS error of the calculated results.Then we replace the level of conformance with the trust rank (see Section 6)

World Wide Web (201 ) 1 : –85 73 109 99

and recalculate the rating scores and measure RMS error again. Based on ourexperiments, the RMS error of applying the main algorithm on the datasets is 0.55698and the RMS error of using trust ranks rather than conformance level is 0.53978. Theresults shows that the accuracy of calculated rating scores slightly increases when wesubstitute the level of conformance with the corresponding trust rank.

8 Related work

As demonstrated in [42], as the rating systems get more and more popular andmore users rely on them to decide on purchases from online stores, the temptationto obtain fake rating scores for products or fake reputation scores for people hasdramatically increased. To detect such reviews, Mukherjee et.al., [33] propose amodel for spotting fake review groups in online rating systems. The model analyzestextual feedbacks cast on products in Amazon online market to find collusion groups.They employ FIM [1] algorithm to identify candidate collusion groups and then use 8indicators to identify colluders. In [28] authors assign every review a degree of spamvalue, and based on these values they identify most suspicious users and investigatetheir behavior to find most likely colluders. In [20] authors try to identify fake reviewsby looking for unusual patterns in posted reviews.

In a more general setup, collusion detection has been studied in P2P and reputa-tion management systems; good surveys can be found in [7] and [39]. EigenTrust [21]is a well known algorithm proposed to produce collusion free reputation scores;however, authors in [26] demonstrate that it is not robust against collusion. Anotherseries of works [29, 43, 44] use a set of signals and alarms to point to a suspiciousbehavior. The most famous ranking algorithm of all, the PageRank algorithm [35]was also devised to prevent collusive groups from obtaining undeserved ranks forwebpages.

Iterative methods for trust evaluation and ranking have been pioneered in [22, 45].Some of the ideas from these papers, as the authors mention, were among the staringpoints of [11–13]; the proof techniques which we used in this paper were inspired bythe techniques developed in [11].

The algorithm proposed in [22] by Laureti and his colleagues is a classic iterativealgorithm which uses the distance of the approximate ranks being computed to theratings provided by a user to calculate the trust rank of such a user. Such obtainedtrust ranks are in turn used to weigh this user’s ratings when the new approximationof the ranks of items are obtained as a weighted average of the ratings of all users.While authors provide a comprehensive statistical analysis of the behavior of theiralgorithm, its robustness against collusion is not well studied.

De Kerchove and Dooren have proposed a set of iterative voting systems forcalculating rating scores and called it “iterative filtering” [13]. A detailed proof isprovided for the convergence of their algorithms. However, the performance of thiswork is not evaluated in the presence of massive intelligent collusion attacks.

The algorithm proposed by Zhou et al. [46] uses Pearson correlation coefficientto find the correlation between reviewers and the true values, and assign weights totheir cast reviews based on this correlation. The algorithm shows a good performancewhen applied to both synthetic and real-world data.

World Wide Web (201 ) 1 : –85 73 109100

Six reputation based rating algorithms have been proposed by Li et al. [25]. Themain difference between these six algorithms is in the method by which they calculatethe distance from the approximations of the rating scores, and the aggregationmethod which they use to build the trust ranks. The main shortcoming of thiswork is that the its performance is not well studied in the presence of unfairness.Ayday and coauthors [4] have proposed another iterative algorithm for trust andreputation management. Unlike other related work, this algorithm blacklists outliersand eliminates their evaluations entirely.

Our present method sharply differs from all of these prior iterative methods byvirtue of entirely decoupling the credibility assessment from the score aggregation.More precisely, the main idea used in [22, 45] as well as in [11–13] is to produce ateach stage of iteration an approximation of the final ratings of the objects and thencalculate for each rater the degree of her “belief divergence” from such calculatedapproximations, i.e., calculate some distance measure between her proposed ranksand these approximations of the final ranks of objects. In the subsequent round ofiteration a new approximation of the ranks of all objects is obtained as a weightedaverage of the ranks proposed by raters, with the weight given to each rater’s rankinversely related to her corresponding distance from the approximate final ranksobtained in the previous round of iteration. Thus, in this manner, the ranks ofobjects are produced simultaneously with an assessment of the trustworthiness ofthe raters as reflected in the weights given to their proposed ranks. Unfortunately,such algorithms suffer a very serious stability problem: if during the initial stages ofiteration the current approximation gets close to the scores of a particular rater, suchalgorithms converge rapidly to these scores and give a zero weight to all other raters.Such situation can happen either accidentally or as a result of a collusion attack. Insuch an attack a group of colluding raters C1, . . . , Cn propose unrealistic evaluations,while another colluder C0 proposes evaluations which are the mean of the realisticmarks, likely to be close to the average of marks given by other, non-colluding raters,and the skewed evaluations of the colluders. In such a scenario it is very likely thatthe initial approximation produced by the standard iterative filtering algorithms willbe very close to the evaluations provided by C0 and will cause such algorithms toconverge to evaluations of C0.9

To overcome such a problem, the authors in [11–13] introduce a regularisationconstant which ensures that the distance between an approximation of the finalratings and the evaluations cast by any of the raters cannot become arbitrarily small.However, if such a constant is sufficiently large to ensure stability and robustness ofthe algorithm, the final rating scores become very close to the simple mean of all castratings.

In contrast, our iterative method operates only on credibility assessment of ratersand on the levels of the community approval of items, which are obtained withoutusing the fact that the items voted on are rating levels. In fact, as it is obvious fromour algorithm, we have never used any comparisons of the proposed rating levels oreven any ordering of the rating levels. We only rely on the levels of concurrence ofthe opinions of raters. This not only makes our algorithm unconditionally stable andcollusion robust but also allows us to subsequently independently choose how to use

9This attack strategy is due to Mohsen Rezvani (private communication, unpublished, with apermission).

World Wide Web (201 ) 1 : –85 73 109 101

the obtained estimates of the ‘community sentiment’ to produce the aggregate ratingscores of products.

The second author and his past co-authors, unaware of the pioneering work in[11–13, 22, 45], have proposed in [19] a fixed-point algorithm for trust evaluation inonline communities and subsequently in [8] an algorithm for aggregating assignmentmarks given by multiple assessors. This method was later applied to aggregation ofsensor readings in wireless sensor networks, in the presence of sensor faults [6].In [23] he also proposed the idea of applying an iterative procedure for voteaggregation; however, the proposed method had some serious shortcomings. In thenotation of the present paper, denoting again the total number of voters by N andtotal number of voting lists by L, the recursion for computing the trustworthiness Trof a rater Vr proposed in [23] was given by

T(n+1)r =1L

∑

l,i

(∑m : m,r→li T(n)m∑N

m=1 T(n)m

) pp+1

.

Unfortunately, the normalizing factor in the denominator on the righthand sidecan become excessive as the number of voters who did not vote in any electionsin which Vr has voted increases, making the rank computation unstable. Also, theexponent pp+1 is always smaller than 1, and this severely limits the robustness of theproposed method against collusion attacks. The algorithm aimed to relate (a powerof) the ratios between trustworthiness of any two voters to the ratios of the numbersof votes received by the candidates chosen by these voters. It also normalized thetrustworthiness of voters, instead of normalizing the credibility of levels; however, aswe do it in our present algorithm, normalizing credibility of levels, which are goingto be used as weights in a subsequent (independent) computation of ranks of objects,not only makes more sense but also allows an elegant proof of convergence, missingin [23].

In summary, unlike the existing models for collusion detection, we do NOT relyon any clustering techniques, local indicators or averaging; also, our method doesnot rely on any approximation of the final rating scores, making rating an entirelyindependent process from the credibility assessment. Moreover, we take into accountthe helpfulness of the posted reviews which is missing in almost all previous work.Finally, using the results of the rating algorithm and analyzing the voting behavior ofthe people, we assign them trust ranks which differentiates our research from otherrelated work.

9 Conclusions and future work

In this paper, we proposed an iterative method for robust evaluation of products andreviewers in social rating systems. We proposed a framework comprising three maincomponents: computing products’ rating scores, analyzing the behavior of reviewers,and computing a trust rank for every reviewer based on the result of this analysis. Forcalculating rating scores, we propose an iterative algorithm which takes into accountthe concurrence of votes on the quality of a product as well as the helpfulness ofthe cast votes. Existing iterative methods, such as [11–13, 22, 45], approximate ratingscores using techniques involving weighted averages. However, averages generally

World Wide Web (201 ) 1 : –85 73 109102

have the propensity to blur statistical features because they smooth out data. In ourmethod, the weight assigned to the votes of raters, i.e., the ‘degree of conformance’of the rater, is computed purely from the concurrence of opinions, without anyaveraging at all. In fact, in our method, the ordering of the range of credibility levels(i.e., an increasing ordering from e.g., 1 to 10) is NOT considered at all - we treatsuch domain as an unordered set, and only consider the concurrence of opinions.

To evaluate reviewers, we compute the trust rank of a reviewer based on threedifferent aspects of her behavior such as her conformance to community consensus,her error rates, etc. We employ fuzzy techniques to deal with the level of uncertaintyand impreciseness which is a part of the nature of the behavioral factors like errorrate, trust and so on. The evaluation results demonstrate the robustness of ourmethods against unfair collusive attacks, and against massive amounts of collusivevotes which can easily break down any rating system.

In our future work, we will further refine our method by taking into account morestatistical features of the behavior of the reviewers as well as more informationextracted from cast votes such as their distribution models, etc. with the aim toproduce a complete yet fully flexible evaluation framework which can be preciselytuned to produce rating scores and trust ranks reflecting any desired statisticalfeature of evaluation data.

Appendix: Proof of convergence

Let m ≥ 1 be a parameter; for a given community approval credibility degreeρ of items, referred in the sequel simply as credibility degree, let us define thecorresponding conformance degree Tr(ρ) of a voter Vr as

Tr =∑

l,i : r→liρli κrl +

∑

l : r→l

∑

1≤ j≤nl

1 − κrlm + 1 ρ

m+1l j

and denote by T(ρ) the vector of such conformance degrees, T(ρ) =〈T1(ρ), T2(ρ), . . . , TN(ρ)〉. Let α ≥ 1 be a real parameter; then ‖T‖α+1 =(∑

r Tα+1r (ρ)

) 1α+1 is the α + 1-norm of T(ρ) ∈ RN .

We now wish to assign the credibility degrees to the items ranked so that:

1. for every list �l , the vector of credibility degrees of all items on that list is a unitvector, i.e., ‖(ρ)l‖2 = 1;

2. the α + 1 norm ‖T‖α+1 of the conformance vector T(ρ) is maximized.10

This, in a sense, gives “the benefit of the doubt” to all voters, giving them the largestpossible joint conformance degrees in terms of the norm of the conformance vector),

10The reason for considering ‖T(ρ)‖α+1 rather than ‖T(ρ)‖α will be clear below.

World Wide Web (201 ) 1 : –85 73 109 103

while maintaining for each list the same unit l2 norm of the vector corresponding toranks of all objects on that list. Let us define

βli(ρ) =∑

r : r→liκrlTαr (ρ) +

∑

r : r→l(1 − κrl)ρmli Tαr (ρ);

F(ρ) =

∑

r

⎛

⎝∑

l,i : r→liρli κrl +

∑

l : r→l

∑

1≤ j≤nl

1 − κrlm + 1 ρ

m+1l j

⎞

⎠

α+1

.

Note that F(ρ) = (‖T‖α+1)α+1 and our aim is equivalent to maximizing F(ρ), subjectto the constraints

C ={

∑

i

ρ2li = 1, 1 ≤ l ≤ L}

.

For this purpose we introduce for each list �l a Lagrangian multiplier λl , define λ =〈λ, : 1 ≤ λ ≤ L〉 and look for the stationary points of the Lagrangian function

�(ρ,λ) = F(ρ) −∑

q

λq

(

−1 +∑

m

ρ2qm

)

A straightforward computation gives

∂ F(ρ)∂ρli

= (α + 1)∑

r

⎛

⎝Tαr (ρ)∑

s, j : r→sj

∂

∂ρliρsj κrs

⎞

⎠

+ (α + 1)∑

r

⎛

⎝Tαr (ρ)1 − κrlm + 1

∂

∂ρli

∑

1≤ j≤nlρm+1l j

⎞

⎠

= (α + 1)(

∑

r : r→liκrl Tαr (ρ) +

∑

r : r→l(1 − κrl) ρmli Tαr (ρ)

)

= (α + 1)βli(ρ).Thus, we get

∂�(ρ,λ)

∂ρli= ∂ F(ρ)

∂ρli− 2λlρli

= (α + 1)βli(ρ) − 2λlρli (5)If (ρ,λ) is a stationary point of � then ∂�(ρ,λ)

∂ρli= 0, in which case

ρliλl = (α + 1)βli(ρ)2 . (6)By squaring (6) and summing over all indices i of objects on the list l we get

λl2

nl∑

i=1ρ2li =

(α + 1)24

nl∑

i=1β2li(ρ).

World Wide Web (201 ) 1 : –85 73 109104

Since (ρ,λ) is a stationary point of � also ∂�(ρ,λ)∂λl

= 0; this implies ∑nli=1 ρ2li = 1, andsince by (6) λl must be positive, we obtain from the above and from (6)

ρli = βli(ρ)√∑nlm=1 β

2lm(ρ)

. (7)

We now define ρ → (ρ)∗ to be the mapping such that for an arbitrary ρ,

(ρ)∗li =βli(ρ)

√∑nlm=1 β

2lm(ρ)

. (8)

Recall that we denote by (x)l the projection of a vector x ∈ RM to the subspace ofdimension nl , which corresponds to a list �l ; then (8) can be written as

(ρ ∗)l = (∇F(ρ))l‖(∇F(ρ))l‖2 . (9)

Thus, (σ ,λ) is a stationary point of � just in case σ ∗ = σ , i.e.,

(σ )l = (∇F(σ ))l‖(∇F(σ ))l‖2 . (10)

We now see that in our algorithm the approximation ρ (n+1) of the vector ρ obtainedat the stage of iteration (n + 1) is given by

ρ (n+1) = (ρ (n))∗

and that our algorithm will halt when ρ (n) get close to a fixed point σ = (σ ) ∗ ofthe operation ρ → (ρ) ∗, which is also a stationary point of the Lagrangian function�(ρ,λ). As we will see, such a point σ is a constrained local maximum of F(ρ),subject to the constraints ‖(ρ)l‖ = 1, 1 ≤ l ≤ L. Note that we do not need to provethe uniqueness of such a fixed point because our final credibility degrees are definedas the outputs of our algorithm, and we only need to prove that our algorithmeventually terminates; for this purpose just approaching a f ixed point is sufficient.

Let ρ be an arbitrary vector such that ‖(ρ)l‖2 = 1 for all 1 ≤ l ≤ L, and let ρ ∗ =(ρ) ∗ and h = ρ ∗ − ρ; then, by applying the Taylor formula with the remainder in theLagrange form, we get that for some 0 < c < 1 and μc = cρ + (1 − c)ρ ∗ we have

F(ρ ∗) = F(ρ + h)

= F(ρ) + ∇F(ρ) · h + 12

∑

l,m,i, j

∂2 F(μc)∂ρli∂ρmj

hlihmj (11)

Since

(h)l = (ρ ∗)l − (ρ)l = (∇F(ρ))l‖(∇F(ρ))l‖2 − (ρ)l (12)

World Wide Web (201 ) 1 : –85 73 109 105

we get

(∇F(ρ))l · (h)l = (∇F(ρ))l ·(

(∇F(ρ))l‖(∇F(ρ))l‖2 − (ρ)l

)

= ‖(∇F(ρ))l‖2 − (∇F(ρ))l · (ρ)l= ‖(∇F(ρ))l‖2 − ‖(∇F(ρ))l‖2(ρ ∗)l · (ρ)l= ‖(∇F(ρ ∗))l‖2(1 − (ρ ∗)l · (ρ)l)

Let θl be the angle between the unit vectors (ρ)l and (ρ ∗)l , i.e., such that cos θl =(ρ)l · (ρ∗)l . Then,

∥∥∥∥(h)l

2

∥∥∥∥

2

2=

(

sinθl

2

)2= 1 − cos θl

2= 1 − (ρ)l · (ρ

∗)l2

.

Combining the last two formulas we get

(∇F(ρ))l · (h)l = ‖(∇F(ρ))l‖2 ‖(h)l‖22

2. (13)

Assu

Documents

Robust evaluation of products and reviewers in social rating …cs4121/Robust evaluation of products... · 2016-08-14 · Keywords Product rating·Reviewer trust·Iterative voting·Collusive