8
Anomaly Detection in Feedback-based Reputation Systems through Temporal and Correlation Analysis Yuhong Liu and Yan (Lindsay) Sun Department of Electrical and Computer Engineering University of Rhode Island, Kingston, RI 02881 Emails: {yuhong, yansun}@ele.uri.edu Abstract—As the value of reputation systems is widely recognized, the incentive to manipulate such systems is rapidly growing. We propose TAUCA, a scheme that identifies malicious users and recovers reputation scores from a novel angle: combination of temporal analysis and user correlation analysis. Benefiting from the rich information in the time-domain, TAUCA identifies the products under attack, the time when attacks occur, and malicious users who insert dishonest ratings. TAUCA and two other representative schemes are tested against real user attack data collected through a cyber competition. TAUCA demonstrates significant advantages. It largely improves the detection rate and reduces the false alarm rate in the detection of malicious users. It also effectively reduces the bias in the recovered reputation scores. I. Introduction The Internet has created vast opportunities for interacting with strangers. The interactions can be fun, informative, and even profitable [1]. However, there is also risk involved. Will a seller at eBay ship the product in time? Is the advice from a self- proclaimed expert at Epinion.com trustworthy? Does a product at Amazon.com have high quality as described? To address these problems, one of the most ancient mechanisms in the history of human society, word-of-mouth, is gaining new significance in the cyber space, where it is called reputation system [2]. A reputation system collects evidence about the properties of individual entities, analyzes and aggregates the evidence, and disseminates the aggregated results. Here, the entity can be people (e.g. in eBay), a product (e.g. in Amazon product rating), or a piece of digital information (e.g. video clip at YouTube). The aggregated results are called reputation scores. Most commercial systems collect user feedback as evidence. This type of system is referred to as feedback-based reputation system. The value of reputation systems has been proved [3], [4]. For example, a survey conducted by comScore Inc. and The Kelsey Group revealed that consumers were willing to pay at least 20% more for services receiving an “Excellent,” or 5-star, rating than for the same services receiving a “Good,” or 4-star, rating [5]. As a consequence, the incentive to manipulate and mislead the reputation systems is rapidly growing. There is ample evidence showing that firms post biased ratings and reviews to praise their own products. There are even scammers making profit by writing sophisticated programs to automatically insert feedbacks [6]. In the current literature, most defense schemes [7]–[10] detect dishonest ratings by examining whether some rating values are far away from the majority’s opinion. This philosophy works fine when the attackers insert a small number of dishonest ratings that are very different from honest users’ rating values. However, it may (1) generate misleading results when the number of dishonest We sincerely thank Qinyuan Feng for designing and administrating the Cyber Competition. This research is partially supported by NSF award #0643532 and China Scholarship Council (CSC) #3020. ratings is large and (2) yield high false alarm rate when normal ratings have a large variance and dishonest ratings are not too far away from the majority’s opinion. Besides dishonest rating detection, there are defense schemes protecting reputation systems from other angles, such as evaluating trust in raters [11], providing incentive to honest users [9], and increasing the cost of acquiring multiple user IDs [12]. However, there are still limitations that are not fully addressed. Time domain, which contains rich information about honest user behavior and attacks, is not fully exploited. The current approaches address the time factors in two ways. In the first way, all ratings are treated equally and the time when ratings are provided is ignored. In the second way, recent ratings are given larger weight when computing the reputation scores. These simple approaches neglect the great potential of inves- tigating time-domain information. To evaluate a reputation system, the researchers need data representing honest users and data representing malicious attacks. The former can be obtained from real systems, such as Amazon.com. However, it is extremely difficult to obtain attack data from real systems mainly because there is no ground truth indicating whether particular ratings are from attackers or not. The real human users can create multifaceted, coordinated, and sophisticated attacks that are not well understood yet. Thus, the lack of realistic attack data can hurt performance evaluation. The correlation among attackers’ behavior, which can provide valuable side information for malicious user detection, has not been fully investigated yet. In this paper, we propose an anomaly detection scheme, named TAUCA, for feedback-based reputation systems. Here, TAUCA is the abbreviation for joint T emporal A nd U ser C orrelation A nalysis. In particular, TAUCA employs a change detector, an intrinsically suitable tool for temporal analysis, which sensitively describes the changing trend in time domain. In the change detector, small shifts in probabilistic model are not ignored but cumulated until they exceed a certain range. TAUCA detects suspicious time intervals in which attacks are highly likely present. Furthermore, TAUCA reduces false alarm rate by analyzing correlation among suspicious users. Finally, the performance of TAUCA and two other schemes are evaluated based on real user attack data collected through a cyber competition. The design and performance evaluation in this paper addresses all three limitations mentioned above. Compared with the representative existing schemes in [7] and [8], TAUCA demonstrates significant advantages in terms of improving detection rate and reducing false alarm rate in the detection of malicious users. The rest of the paper is organized as follows. Section II

Anomaly Detection in Feedback-based Reputation Systems …yuhong/My Paper/Anomaly... · 2010-07-06 · TAUCA, for feedback-based reputation systems. Here, TAUCA is the abbreviation

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Anomaly Detection in Feedback-based Reputation Systems …yuhong/My Paper/Anomaly... · 2010-07-06 · TAUCA, for feedback-based reputation systems. Here, TAUCA is the abbreviation

Anomaly Detection in Feedback-based ReputationSystems through Temporal and Correlation Analysis

Yuhong Liu and Yan (Lindsay) SunDepartment of Electrical and Computer Engineering

University of Rhode Island, Kingston, RI 02881Emails: {yuhong, yansun}@ele.uri.edu

Abstract—As the value of reputation systems is widely recognized,the incentive to manipulate such systems is rapidly growing. Wepropose TAUCA, a scheme that identifies malicious users andrecovers reputation scores from a novel angle: combination oftemporal analysis and user correlation analysis. Benefiting from therich information in the time-domain, TAUCA identifies the productsunder attack, the time when attacks occur, and malicious userswho insert dishonest ratings. TAUCA and two other representativeschemes are tested against real user attack data collected through acyber competition. TAUCA demonstrates significant advantages. Itlargely improves the detection rate and reduces the false alarm ratein the detection of malicious users. It also effectively reduces the biasin the recovered reputation scores.

I. IntroductionThe Internet has created vast opportunities for interacting with

strangers. The interactions can be fun, informative, and evenprofitable [1]. However, there is also risk involved. Will a sellerat eBay ship the product in time? Is the advice from a self-proclaimed expert at Epinion.com trustworthy? Does a productat Amazon.com have high quality as described?

To address these problems, one of the most ancient mechanismsin the history of human society, word-of-mouth, is gaining newsignificance in the cyber space, where it is called reputation system[2]. A reputation system collects evidence about the propertiesof individual entities, analyzes and aggregates the evidence, anddisseminates the aggregated results. Here, the entity can be people(e.g. in eBay), a product (e.g. in Amazon product rating), or apiece of digital information (e.g. video clip at YouTube). Theaggregated results are called reputation scores. Most commercialsystems collect user feedback as evidence. This type of system isreferred to as feedback-based reputation system.

The value of reputation systems has been proved [3], [4]. Forexample, a survey conducted by comScore Inc. and The KelseyGroup revealed that consumers were willing to pay at least 20%more for services receiving an “Excellent,” or 5-star, rating thanfor the same services receiving a “Good,” or 4-star, rating [5].As a consequence, the incentive to manipulate and mislead thereputation systems is rapidly growing. There is ample evidenceshowing that firms post biased ratings and reviews to praise theirown products. There are even scammers making profit by writingsophisticated programs to automatically insert feedbacks [6].

In the current literature, most defense schemes [7]–[10] detectdishonest ratings by examining whether some rating values arefar away from the majority’s opinion. This philosophy works finewhen the attackers insert a small number of dishonest ratings thatare very different from honest users’ rating values. However, itmay (1) generate misleading results when the number of dishonest

We sincerely thank Qinyuan Feng for designing and administrating the CyberCompetition. This research is partially supported by NSF award #0643532 andChina Scholarship Council (CSC) #3020.

ratings is large and (2) yield high false alarm rate when normalratings have a large variance and dishonest ratings are not too faraway from the majority’s opinion.

Besides dishonest rating detection, there are defense schemesprotecting reputation systems from other angles, such as evaluatingtrust in raters [11], providing incentive to honest users [9], andincreasing the cost of acquiring multiple user IDs [12]. However,there are still limitations that are not fully addressed.

• Time domain, which contains rich information about honestuser behavior and attacks, is not fully exploited. The currentapproaches address the time factors in two ways. In the firstway, all ratings are treated equally and the time when ratingsare provided is ignored. In the second way, recent ratings aregiven larger weight when computing the reputation scores.These simple approaches neglect the great potential of inves-tigating time-domain information.

• To evaluate a reputation system, the researchers need datarepresenting honest users and data representing maliciousattacks. The former can be obtained from real systems,such as Amazon.com. However, it is extremely difficult toobtain attack data from real systems mainly because thereis no ground truth indicating whether particular ratings arefrom attackers or not. The real human users can createmultifaceted, coordinated, and sophisticated attacks that arenot well understood yet. Thus, the lack of realistic attack datacan hurt performance evaluation.

• The correlation among attackers’ behavior, which can providevaluable side information for malicious user detection, hasnot been fully investigated yet.

In this paper, we propose an anomaly detection scheme, namedTAUCA, for feedback-based reputation systems. Here, TAUCAis the abbreviation for joint Temporal And User CorrelationAnalysis. In particular, TAUCA employs a change detector, anintrinsically suitable tool for temporal analysis, which sensitivelydescribes the changing trend in time domain. In the changedetector, small shifts in probabilistic model are not ignored butcumulated until they exceed a certain range. TAUCA detectssuspicious time intervals in which attacks are highly likely present.Furthermore, TAUCA reduces false alarm rate by analyzingcorrelation among suspicious users. Finally, the performance ofTAUCA and two other schemes are evaluated based on real userattack data collected through a cyber competition. The designand performance evaluation in this paper addresses all threelimitations mentioned above. Compared with the representativeexisting schemes in [7] and [8], TAUCA demonstrates significantadvantages in terms of improving detection rate and reducing falsealarm rate in the detection of malicious users.

The rest of the paper is organized as follows. Section II

Page 2: Anomaly Detection in Feedback-based Reputation Systems …yuhong/My Paper/Anomaly... · 2010-07-06 · TAUCA, for feedback-based reputation systems. Here, TAUCA is the abbreviation

discusses the related work, Section III describes the system andattack models, Section IV introduces the details of TAUCA,and Section V presents the experiment results, followed by theconclusion in Section VI.

II. Related Work

The existing defense mechanisms protect feedback-based repu-tation systems from several angles.

1. Increasing the cost of acquiring multiple user IDs by bindingidentity with IP address [13], using network coordinates todetect sybil attacks [12], and analyzing trust relationship insocial networks to identify collusion groups [14].

2. Discouraging false feedbacks by providing incentive to userswho give useful feedbacks [4] [9].

3. Detecting dishonest feedbacks through statistical techniques[10], [15], [16].

4. Mitigating effects of false feedbacks. Based on dishonestfeedback detection, the system can build trust in raters [10],[11] and utilize trust information to reduce the impact ofmalicious raters.

We elaborate representative work in category 3, where theproposed scheme belongs to. In [7], a user is considered as amalicious user if the calculated reputation score of an object(rated by this user) lies outside q (lower) or (1 − q) (upper)quantile of this user’s underlying rating distribution. The entropybased approach, proposed in [10], identifies the ratings that causea significant change in the uncertainty of rating distribution asdishonest ratings. The iteration refinement approach, proposed in[8], estimates the quality of each object as weighted average of allratings to this object. The weight assigned to a user is proportionalto his/her judging power, which is the inverse of rating variance.In [9], dishonest rating analysis is conducted based on Bayesianmodel. In [14], dishonest ratings are determined by clusteringtechniques.

As discussed in Section I, most of the above methods dependon the difference between the dishonest ratings and the major-ity’s opinion. They have some inherit limitations. The proposedapproach will detect dishonest ratings and abnormal raters froma different angle - analyzing time-domain information and corre-lation among the behaviors of raters∗. The proposed method iscompatible with many defense schemes in category 1, 2, and 4.

III. System Model and Attack Model

System Model: In this paper, we model the feedback-basedreputation systems as the system in which users provide ratingsto objects. This model can describe many practical systems. Forexample, buyers provide ratings to products on Amazon.com, andmembers rate local shops on Epinions.com. The objects in abovesystems are products and shops, respectively.

The ratings in our system take integer values from 1 to 5. Inreality, one user may provide several ratings for the same object,and we only count one. In addition, many users provide ratings andreviews at the same time. Note that the reviews may be untruthful.In this work, we only consider ratings. The analysis on the reviewsand the TAUCA scheme complement each other.

Attack Model: In the attack model, an attacker may controlone or several user IDs and each of these user ID is referred to

∗Note that the time-domain information was exploited in our previous workin [17]. The previous approach used totally different detectors and did not havecorrelation analysis.

as a malicious user. The malicious users controlled by the sameattacker collude with each other, and try to downgrade or boost thereputation score of objects. All ratings provided by malicious usersare considered as dishonest ratings. An attack profile describesthe behavior of all malicious users controlled by one attacker. Ifan object is under attack, attack direction describes whether theattacker tries to downgrade or boost the reputation score of thisobject.

Defense Goal: To protect feedback-based reputation system, thedefense solution should be able to (a) obtain accurate object rep-utation even when the object is under attack, (b) detect malicioususers and (c) remove dishonest ratings.

Assumption: In this work, we assume that the objects haveintrinsic quality, which does not change rapidly. This assumptionis true when the objects are products, hotels, and shops, but maynot be true when the objects are movies whose quality is difficultto judge.

IV. Joint Temporal and User Correlation Analysis (TAUCA)

A. OverviewThe proposed TAUCA scheme contains three main components:

(a) change detection, (b) user correlation calculation, and (c)malicious user group identification.

In many practical reputation systems, the objects have intrinsicand stable quality, which should be reflected in the distribution ofnormal ratings. If there are rapid changes in the rating values, suchchanges can serve as indicators of anomaly. Therefore, we designa change detector in TAUCA, which takes raw rating values asinputs and detects changes either occurring rapidly or accumulatedover time. If the change detector is triggered by an object, thisobject is marked as under attack. The direction of the change,either boosting or downgrading, is called attack direction. Thetime intervals in which the changes occur are called suspiciousintervals.

However, the suspicious intervals can contain many normalratings for two reasons. First, it is difficult to identify normalratings surrounded by dishonest ratings. Second, normal usersmay provide rating values too high or too low due to theirunusual experience or rating habit. Therefore, we introduce theuser correlation analysis module. Previous work, such as [18]and [19], show that there may exist large correlation amongcolluded malicious users. Thus, we calculate the similarity level(i.e. distance) between users who rate in the suspicious intervals.Based on such distance, users are separated into different clusters.

Finally, the malicious user group identification module willidentify which cluster is composed of colluded malicious users.The ratings from detected malicious users are filtered out, and theremaining ratings are used to calculate the object reputation.

Figure 1 demonstrate the structure of TAUCA. The designdetails will be discussed in the rest of this section.

B. Temporal Analysis - Change DetectorChange detectors, such as Shewhart, Finite weighted Moving

Average, and CUSUM [20], are traditionally used in qualitycontrol, intrusion detection, spam filtering and medical systems toidentify changes in probability distribution of a stochastic randomprocess.

In reputation systems, since the normal ratings do not neces-sarily follow a specific distribution and the attacker may insertdishonest ratings with small bias, we need to choose the change

Page 3: Anomaly Detection in Feedback-based Reputation Systems …yuhong/My Paper/Anomaly... · 2010-07-06 · TAUCA, for feedback-based reputation systems. Here, TAUCA is the abbreviation

Fig. 1: System architecture

detector that is able to reliably detect small shifts and insensitiveto the probability distribution of data.

CUSUM detectors, first proposed in [20], fulfill these require-ments. In addition, compared with other change detectors, thedetection delay of CUSUM is relatively small for a given falsealarm rate [21]. Therefore CUSUM is chosen as the base to buildour detector.B.1 Basic CUSUM

We first introduce the basic CUSUM detector, which determineswhether a parameter θ in a probability density function (PDF) haschanged. That is, to determine between two hypothesis: H0 : θ =θ0 and H1 : θ = θ1. Let pθ0 and pθ1 denote the PDF before andafter the change, respectively. Let yk denote the kth sample of thedata sequence (i.e. rating sequence). The basic CUSUM decisionfunction is

gk = max(gk−1 + lnpθ1(yk)

pθ0(yk), 0) , (1)

ta = min{k : gk ≥ h} , (2)

where h is threshold. Here, ta is called stopping time, the timewhen the detector identifies a change and raises an alarm. Eachtime when gk ≤ 0 or gk ≥ h, CUSUM detector restarts by settinggk = 0 and a new round of detection begins.

When pθ0 is a Gaussian process with mean µ0, pθ1 is a Gaussianprocess with mean µ1, and both have variance σ2, equation (1)detects mean change and becomes

gk = max(gk−1 + (yk − µ0 −µ1 − µ0

2), 0) . (3)

Even if the distributions are not Gaussian, the above detector isstill sensitive to mean change [21].B.2 Revised CUSUM

Basic CUSUM cannot be directly applied to our problem fortwo reasons. First, we need to detect the starting time and endingtime of the change. In the basic CUSUM, the stopping time(ta) is determined once the detection function (gk) exceeds thethreshold. At ta, an alarm is issued, the staring time of thechange (denoted by ts) can be estimated, and the CUSUM detectorrestarts. However, after ta, the change may last for some timeand we need to derive the ending time of the change fromthe successive gk. Therefore, CUSUM should not restart at ta.Second, many application scenarios have strict requirement onthe detection delay, defined as the difference between startingtime ts and stopping time ta. As a consequence, the threshold (h̄)in the basic CUSUM is selected based on the trade-off between

detection delay and false alarm rate. In the reputation systems,however, we do not have a strict requirement on detection delay,and the selection of threshold should rely on the trade-off betweendetection rate and false alarm rate. Therefore, we need to revisethe basic CUSUM detector. In the revised CUSUM, we defineoff-time, denoted by tb, as the time when gk falls below thresholdh. That is,

tb = min{k : gk < h& k ≥ ta} (4)

The revised CUSUM can detect multiple change intervals. Fora specific change interval, let ts denote the starting time of thechange and te denote the ending time of the change. We also definet1 as the counting-start-time. If there is no previously detectedchange, set t1 = 0. If there is a previously detected change, whoseoff-time is tpreb , we set t1 = tpreb .

From the detection curve (gk), we can obtain ta and tb. Thenext task is to estimate ts and te, which define a specific changeinterval.

• Starting time of the change interval (ts)Assume that the data sequence follows distribution pθ0 before ts,and pθ1 after ts. Let c denote the index of data samples. Thenmaximum likelihood estimator (MLE) of ts is derived as

t̂s = arg maxt1≤c≤ta

ln[c−1∏i=t1

pθ0(yi)

ta∏i=c

pθ1(yi)]

= arg maxt1≤c≤ta

[ln

ta∏i=t1

pθ0(yi) + ln

∏tai=c pθ1(yi)∏tai=c pθ0(yi)

] (5)

= arg mint1≤c≤ta

c−1∑i=t1

lnpθ1(yi)

pθ0(yi)(6)

Notice that (6) is derived from (5) using the fact thatln∏ta

i=t1pθ0(yi) is a fixed term. The estimated starting time t̂s

is between t1 and ta.• Ending time of the change interval te

Next, we estimate the ending time te using the MLE estimator.Here, we assume the data sequence follows distribution pθ0between t1 and ts, follows distribution pθ1 between ts and te,and follows distribution pθ0 between te and tb. Let h denote theindex of data samples.

t̂e = arg maxt̂s≤h≤tb

ln [

t̂s−1∏i=t1

pθ0(yi)

h∏i=t̂s

pθ1(yi)

tb∏i=h+1

pθ0(yi)]

(7)

= arg maxt̂s≤h≤tb

[ln

tb∏i=t̂s

pθ1(yi) + ln

∏tbi=h+1 pθ0(yi)∏tbi=h+1 pθ1(yi)

] (8)

= arg maxt̂s≤h≤tb

tb∑i=h+1

lnpθ0(yi)

pθ1(yi)(9)

We use the fact that ln∏t̂s−1

i=t1pθ0(yi) is a fixed term to derive

(8) from (7), and the fact that ln∏tb

i=t̂spθ1(yi) is a fixed term to

derive (9) from (8). Figure 2 illustrates the relationship betweent1, ta, tb, ts and te.

Figure 3 illustrates the performance of the basic CUSUM andthe revised CUSUM. The x-axis is the index of ratings. Theupper plot shows the original rating sequence ordered accordingto the time when the ratings are provided. The y-axis is the ratingvalue ranging from 1 to 5. The honest ratings are in blue color,whereas the malicious ratings are in red color. The lower plotshows the detection curves (gk) of the basic CUSUM and the

Page 4: Anomaly Detection in Feedback-based Reputation Systems …yuhong/My Paper/Anomaly... · 2010-07-06 · TAUCA, for feedback-based reputation systems. Here, TAUCA is the abbreviation

Fig. 2: Illustration of the revised CUSUM

revised CUSUM, as well as the detection threshold. The suspiciousintervals are defined by t̂s and t̂e. Two observations are made.

• Compared with the basic CUSUM, the revised CUSUM ismore sensitive to dishonest ratings, especially when theseratings are not far away from honest ratings.

• The suspicious intervals of the revised CUSUM inevitablycontain more honest ratings.

In other words, the revised CUSUM enlarges the suspicious inter-vals such that almost all dishonest ratings are included togetherwith some honest ratings.

Recall that the change detection in TAUCA is for providinginputs to user correlation analysis. If some normal users areinvolved in the suspicious intervals, they can be separated frommalicious users later. However, if some malicious ratings/users arenot included in the suspicious intervals, we cannot detect themlater. Therefore, the revised CUSUM is obviously more suitablefor TAUCA.B.3 Extension to 2-Sided CUSUM

The discussion in Section IV.B.2 only considers the change inone direction. Since malicious users could either boost or down-grade objects’ reputation and cause changes in two directions, weneed to employ 2-sided CUSUM [20].

Assume that the mean value before change is µ0, and the meanvalue after change is either µ+

1 = µ0 + ν or µ−1 = µ0 − ν.

According to (3), we defineg+k = max((g+k−1 + yk − µ0 − ν/2), 0) (10)

g−k = max((g−k−1 − yk + µ0 − ν/2), 0) (11)

In our problem, the value of ν is unknown and treated as aparameter that describes the sensitivity of the detection. Smalleris the ν value, more sensitive is the change detector.

Here, g+k is the detection function for boosting and g−k is thedetection function for downgrading. In our attack model, an objectis under either boosting or downgrading attack. If both g+k and g−kexceed the threshold at the same time, the one with a larger valuedetermines the attack direction.

Procedure 1 summarizes the 2-sided revised CUSUM detector.Line 6-17 are for determining whether an object is under attackand whether the attack is boosting or downgrading. After theattack direction is determined, we assign gk = g+k if it is boostingand gk = g−k if it is downgrading. Then, (2), (4), (6) and (9) areused to determine the change intervals, as presented in Line 19-31in Procedure 1.

As a summary, the 2-sided revised CUSUM detector is usedto (a) detect whether an object is under attack, (b) the attackdirection, and (c) the time intervals (i.e. suspicious intervals orchange intervals) in which the attacks are present.

C. User Correlation and Malicious User DetectionAs discussed earlier, the suspicious intervals contain both honest

and dishonest ratings. Therefore, further processing is needed toaccurately determine dishonest ratings. We define suspicious users

0 20 40 60 80 100 1200

5Rating Sequence

Rating Index

Rat

ing

Val

ue

0 20 40 60 80 100 1200

5

10

15

Rating Index

Det

ectio

n C

urve

( g

k )

CUSUM Detection Curve

bCUSUMrCUSUMThreshold

SI of rCUSUM

SI of bCUSUM

Fig. 3: Demonstration of CUSUM detection curve(SI: Suspicious Interval; rCUSUM: Revised CUSUM; bCUSUM: Basic CUSUM)

Procedure 1 Change detection procedure1: Under Attack = [] //the set containing objects under attack2: Attack direction[]=’None’ //the attack direction for each object.3: for each object Oi do4: collect all ratings for Oi and order them according to the time

that they are provided5: Alarm = 0 //a flag, which is 1 when gk exceeds threshold6: Compute g+i = {g+i (1), g+i (2), · · · g+i (N)} and g−i =

{g−i (1), g−i (2), · · · g−i (N)}7: //Under attack or not8: if (max(g+i ) > Threshold) ∥ (max(g−i ) > Threshold) then9: add Oi to Under attack[]

10: //Get Attack direction11: if max(g+i ) > max(g−i ) then12: Attack direction[i] = ’boosting’13: gi = g+i14: else15: Attack direction[i] = ’downgrading’16: gi = g−i17: end if18: //Get change starting time and ending time19: for each rating k do20: if (Alarm == 0) then21: if (gi(k) > Threshold) then22: Estimate change starting time ts23: Alarm = 124: end if25: else26: if (gi(k) < Threshold) then27: Estimate change ending time te28: Alarm = 029: end if30: end if31: end for32: end if33: end for

as the users who provide ratings in suspicious intervals. For twosuspicious users X and Y , we collect all of their ratings, not justfor the objects under attack. Assume that they have rated n objectsin common, and their ratings for these n objects are denoted by{x1, x2...xn} and {y1, y2, ...yn}, respectively. Then, the distancebetween X and Y is calculated as

d(X,Y ) =

√√√√ n∑i=1

(xi − yi)2. (12)

This Euclidean distance is calculated for each pair of suspicioususers. Then, the suspicious users are divided into several clustersaccording to their distance, using the K-mean [22] method. The

Page 5: Anomaly Detection in Feedback-based Reputation Systems …yuhong/My Paper/Anomaly... · 2010-07-06 · TAUCA, for feedback-based reputation systems. Here, TAUCA is the abbreviation

suspicious users with smaller distances among them are groupedtogether.

For each suspicious user group, we calculate the group ratingas the average rating score of users in this group for the objectunder attack. Recall that the attack direction has been determinedby the change detector. When the attack direction is boosting,the suspicious user group with the largest group rating is markedas the malicious user group, and all users in this group aremarked as malicious users. Similarly, when the attack directionis downgrading, the malicious user group is the group with thesmallest group rating.

As a summary, after the change detection, TAUCA movesfrom time domain to user domain. Through correlation analysis,malicious users are identified.

D. Malicious Rating Filtering and Rating AggregationFinally, all of the ratings from the detected malicious users are

removed. After removing such ratings, our method is compatiblewith any existing reputation schemes that calculate the objectquality reputation. Without loss of generality, we use simpleaveraging in this paper for computing object quality reputation.

V. Experiment Results and Discussion

A. Experiment SetupA.1 Cyber Competition

For any online reputation systems, it is very difficult to evaluatetheir attack-resistance properties in practical settings due to thelack of realistic attack data. Even if one can obtain data from e-commerce websites, there is usually no ground truth about whichratings are dishonest. To understand human users’ attack behaviorsand evaluate TAUCA against non-simulated attacks, we designedand launched a Cyber Competition [23] run from 05/12/2008 to05/29/2008. In this competition,

• We collected real online rating data for 300 products froma famous e-commerce website [24] in China. Such datacontains 300 users’ ratings to these products from day 1 today 150.

• Each player can control at most 30 malicious user IDs.The total number of ratings from malicious user IDs shouldbe less than 100. The players’ goal is to downgrade thereputation score of product (i.e. object) O1.

• The system calculates the reputation score of product O1

every 15 days. In other words, 10 reputation scores arecalculated on day 15, day 30, ... and day 150. The averageof these 10 reputation scores is the overall reputation of O1.

• The competition rule encourages the players to adjust theresources they use (i.e. the number of malicious user IDs).Each player can make many submissions. The effectivenessof each attack submission is measured by the bias introducedby the malicious users in the reputation of O1. The perfor-mance of each player is measured by the effectiveness ofhis/her submissions compared to other players. The detaileddescription of the evaluation criteria can be found at [23].The competition has attracted 630 registered players from 70universities in China and United States. We have collected826,980 valid submissions.

• Each submission, also called as an attack profile, uses aspecific number of malicious users (denoted by Na), and Na

ranges from 1 to 30. Each malicious user can rate any of those300 objects, and cannot rate one object more than once.

0 1 2 3 4−3−2−1

01

28,516 Profiles, Na = 5

Variance

Mea

n O

ffset

0 1 2 3 4−3−2−1

01

18,799 Profiles, Na = 10

Variance

Mea

n O

ffset

0 1 2 3 4−3−2−1

01

15,399 Profiles, Na = 15

Variance

Mea

n O

ffset

0 1 2 3 4−3−2−1

01

13,964 Profiles, Na = 20

Variance

Mea

n O

ffset

0 1 2 3 4−3−2−1

01

13,210 Profiles, Na = 25

Variance

Mea

n O

ffset

0 1 2 3 4−3−2−1

01

17,727 Profiles, Na = 30

Variance

Mea

n O

ffset

Fig. 4: Attack data distribution

Figure 4 demonstrates the diversity of attack data. For eachattack profile, we compute mean offset (MO) as the mean ofdishonest ratings for O1 minus the true quality of O1 (i.e. themean of honest ratings). In Figure 4, x-axis is the variance ofdishonest ratings for O1, y-axis is MO, and each dot representsan attack profile. It shows the distribution of variance and MOvalues for attack profiles, when Na = 5, 10, 15, 20, 25, and 30.It is clear that our data contain very diverse attack behaviors.

In addition even if two attacks have the same MO and variance,the specific attack behavior can be very different. Notice that sincerating values are integers, for any given Na, there are only finitecombinations of MO and variance.A.2 Test Data Preparation

Since we have collected a large amount of attack data, it isimportant to classify them and construct representative data setfor testing purpose. This is done in three steps.

1. We group attack profiles according to the number of mali-cious user IDs used in each profile (i.e. Na), and then select6 groups with Na = 5, 10, 15, 20, 25, and 30, respectively.

2. We filter out inefficient attacks whose MO value is above 0or between [-0.5 and 0]. The attacks with MO value above 0are from the players who misunderstood the rule. Recall thatattack goal is to downgrade the reputation of O1. The attackswith MO value between [-0.5 and 0] have trivial impact onthe reputation of O1 because dishonest ratings are too closeto the true quality of O1.

3. Attack profiles are further divided into 5 levels according totheir MO values: level index l = 1 means MO value between[-1, -0.5); l = 2 means MO value between [-1.5, -1); l = 3means MO value between [-2, -1.5); l = 4 means MO valuebetween [-2.5, -2); and l = 5 means MO value between [-3,-2.5).

Let AIset Na l denote the subset of attack profiles using Na

malicious users and having MO value at level l. Since Na = 5,10, 15, 20, 25, or 30 and l =1, 2, 3, 4, or 5, there are total 30subsets. All of these subsets are collectively referred to as AI

set,which contains 103,054 attack profiles.

Additionally, we construct another data set that contains onlystrong attacks. When there are no defense mechanisms, the MOlevels can roughly describe the strength of attacks. To addressthe attack strength when there are defense mechanisms, we defineattack power (AP) as X−Y , where Y is the true reputation of O1

and X is the calculated reputation after using the defense schemein our Cyber Competition. Briefly speaking, this defense scheme,described in [23], is a combination of dishonest rating detectionand rater trust evaluation. It is important to point out that AP

Page 6: Anomaly Detection in Feedback-based Reputation Systems …yuhong/My Paper/Anomaly... · 2010-07-06 · TAUCA, for feedback-based reputation systems. Here, TAUCA is the abbreviation

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Alarm Rate ( FA )

Det

ectio

n R

ate

( D

R )

ROC Curve of Suspicious Object Detection

Na = 5

Na = 10

Na = 15

Na = 20

Na = 25

Na = 30

Fig. 5: Performance of suspicious object detection in TAUCA.(For each Na, test data set is

∪1≤l≤5

AIset Na l. )

is different from MO. An attack with large MO may be easilydetected and thus have small impact on final reputation scores.

We construct attack data subset AIIset Na

as follows. Among allattacks using the same number of malicious users (i.e. Na), wefirst pick the attack with the largest AP value, and then pick theother attacks whose AP values are greater than 80 percent of thislargest AP. All of the picked attacks are put into AII

set Na. We

construct AIIset Na

for Na =5, 10, 15, 20, 25, and 30. These sixdata sets are collectively referred to as AII

set, and contains 18,175attack profiles.

We would like to point out that the defense scheme in our cybercompetition uses the philosophy that dishonest ratings are far awayfrom the majority’s opinion. There are many popular defenseschemes using the same philosophy. Thus, the attacks in AII

set

are strong when such defense schemes are used. Although AIIset

may not represent strong attacks against every defense scheme,due to the diversity of our data set, we believe that the testingresults using AII

set can reveal many insights of TAUCA and thecomparison schemes.

B. Experiment Results and DiscussionIn this subsection, we first test the building blocks of TAUCA:

the detection of objects under attack (Section V.B.1), the directionof attack (Section V.B.2), and the suspicious intervals (SectionV.B.3). The overall performance of TAUCA will be presented inSection V.B.4. The two existing and representative schemes willbe compared with TAUCA in Section V.C. Testing data set AI

set

will be used in Section V.B, and AIIset will be used in Section V.C.

B.1 Identification of objects under attackIn the first experiment, we evaluate the performance of TAUCA

in terms of determining the objects under attack (i.e. suspiciousobjects). In our system, there is one object (i.e. O1) under attackand 299 objects not under attack.

Assume that TAUCA is tested against N attack profiles. Letn1 denote the number of profiles, for which TAUCA accuratelydetects O1 as suspicious. Then, the detection rate is defined asDR = n1/N . Let m1 denote the total number of objects that arenot under attack but detected as suspicious. Then, the false alarmrate is calculated as FA = m1/(299 ∗N).

Figure 5 shows the Receiver Operating Characteristic (ROC)curves for detecting suspicious objects. We observe that TAUCAcan accurately detect suspicious objects. For example, when thereare 30 malicious users, the detection rate is always above 95%.Even when there are only 5 malicious users, the detection rate isalways above 85%.

0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.2

0.4

0.6

0.8

1

RR

SR

IR

Malicious User Number is 10

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.2

0.4

0.6

0.8

1

RR

SR

IR

Malicious User Number is 20

L1 L2 L3 L4 L5

L1 L2 L3 L4 L5

Fig. 6: Performance of suspicious interval detection in TAUCA(Li means MO level i)

B.2 Attack direction determinationOnce the suspicious objects are detected, TAUCA further de-

termines the attack direction: downgrading or boosting. In thesecond experiment, we set the threshold as 2. It is observed thatthe attack direction is accurately determined in all of the 96,335cases in which O1 is detected as suspicious object.B.3 Suspicious interval detection

Instead of detecting individual dishonest ratings, the CUSUMdetector determines the suspicious intervals in which dishonestratings are present. We denote the detected suspicious intervalsas DSI, and the entire rating sequence as ERS. To measure theperformance of CUSUM detector, we define reduction rate (RR)and suspicious rating inclusion rate (SRIR) as follows.

RR = 1− the number of ratings inDSI

the number of ratings inERS(13)

SRIR =the number of dishonest ratings inDSI

the number of dishonest ratings inERS(14)

Obviously, we prefer higher SRIR for accurate detection ofmalicious users at later steps in TAUCA. We also prefer higherRR, which means the suspicious intervals are “tight” and donot unnecessarily include too many honest ratings. The tradeoffbetween SRIR and RR is shown in Figure 6 for different Na andMO levels. Due to the space limitation, we only show the resultsfor Na = 10 and Na = 20.

When Na = 10 and MO level is 5, to maintain 99.66% SRIR,the proposed CUSUM detector can achieve RR>81.6%. That is,the suspicious intervals, which contain only 18.4% percent of totalratings, include 99.66% of dishonest ratings. When Na = 20 andMO level is 5, to maintain 99.31% SRIR, the proposed CUSUMdetector can achieve RR>71%. The proposed change detectoreffectively “zoom in” on dishonest ratings.B.4 Overall Performance of TAUCA

We first study the detection rate and false alarm rate of TAUCAin terms of detecting malicious users. Figure 7 shows the detectionrate versus false alarm rate (i.e. ROC curves) when the number ofmalicious users (Na) is 20. Obviously, the performance is betterfor higher MO levels. When MO level is 1 and 2, the detection rateis low because the dishonest ratings in these cases are very close tohonest rating and do not have much impact on the final reputationscores. When MO level is higher than 2, TAUCA achieves gooddetection rate. When the false alarm rate is set to be 0.07, thedetection rates for different Na and MO levels are shown in Figure8.

The detection rate of malicious users cannot fully describe theperformance of TAUCA. Obviously, the amount of damage caused

Page 7: Anomaly Detection in Feedback-based Reputation Systems …yuhong/My Paper/Anomaly... · 2010-07-06 · TAUCA, for feedback-based reputation systems. Here, TAUCA is the abbreviation

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Alarm Rate

Det

ectio

n R

ate

Malicious User Detection ROC ( Na = 20 )

L1L2L3L4L5

Fig. 7: Performance of malicious user detection in TAUCA

5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Detection Rate Of Different Malicoius User Number and MO

Number of Malicious Users

Det

ectio

n R

ate

L1

L2

L3

L4

L5

Fig. 8: Detection rate for different attack data set in TAUCA(false alarm rate = 0.07)

by different malicious users can be very different. We care moreabout whether the undetected malicious users can cause largedamage to the final reputation scores. Therefore, we define theRecovered Reputation Offset (RRO) as Z − Y , where Y is thetrue reputation of O1 and Z is the average of remaining ratingsafter TAUCA has removed all ratings from detected malicioususers. Figure 9 shows the RRO for different Na and MO levels.

We make the following observations. First, when the numberof malicious users is fixed, attacks with high MO levels areeasier to detect and cause less damage to the final reputationscore, whereas the attacks with low MO levels cause the largestdamage. Second, the largest damage, measured by the RRO, isbelow 0.13. This largest damage is achieved when the test dataset is AI

set 30 2. Without TAUCA, the attacks in AIset 30 5 cause

the largest damage to the reputation of O1, which mislead thefinal reputation score from 3.98 to 3.39. This means that TAUCAreduces the bias in the final reputation score by a factor of(3.98 − 3.39)/0.13 = 4.5. As a summary, TAUCA is a very

5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14Recovered Reputation Offset of Different Malicious User Number and MO

Number of Malicious Users

Rec

over

ed R

eput

atio

n O

ffset

L1L2L3L4L5

Fig. 9: Recovered reputation offset in TAUCA

0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

Recovered Reputation Offset & Beta

Beta Value

Rec

over

ed R

eput

atio

n O

ffset

N

a = 5

Na = 10

Na = 15

Na = 20

Na = 25

Na = 30

Fig. 10: Performance of IR algorithm for different Na and β

effective method to protect feedback-based reputation systems.Furthermore, the results in this section do not represent the best

performance of TAUCA in terms of RRO. As shown in the nextsubsection, when the threshold is chosen to minimize the RRO,TAUCA can reduce the RRO to 0.03.

C. Performance Comparison

In this section, TAUCA is compared with Iterative Refinement(IR) model [8] and Beta function model [7], two existing and rep-resentative defense schemes for reputation systems. As mentionedearlier, the data set AII

set is used in this part of experiments.C.1 Iterative Refinement model vs. TAUCA

In the IR algorithm, a weight factor is assigned to each user.The final reputation score is calculated as the weighted average ofusers’ rating values. The IR algorithm has a key parameter, calledβ. The details of IR algorithm can be found in [8].

Figure 10 shows the RRO value when IR is used, for differentNa and β values. It is seen that IR can yield very low RRO, whenβ is properly chosen. For example, when β = 0.8, the RRO is0.02, which is slightly lower than the best RRO of TAUCA (seethe blue line in Figure 12). This is the advantage of IR.

However, IR has a major disadvantage. A large amount of users,including honest and dishonest, have very small weight factors.As a consequence, the calculation of the final reputation score inIR only counts the ratings from a few users. We call a user as alow weighted user if its weight is lower than 1% of the highestweight. Table I shows the number of low weighted users and thenumber of other users whose options are actually counted.

Na Num-Reg-Weight Num-Low-Weight5 3.1 118.8

10 2.7 124.315 4.7 126.520 4.3 132.625 3.8 138.130 4.4 142.6

TABLE I: Performance of IR for malicious user detection(Num-Low-Weight: the number of low weighted users.

Num-Reg-Weight: the number of other users whose opinions really count.)

With a large amount of low weighted users, the IR algorithmcannot detect malicious users. In addition, it may raise concernssince it only counts the opinions of very few users.C.2 Beta function model vs. TAUCA

Figure 11 shows the RRO values when the beta-function baseddefense method in [7] is used, where q is the key parameter forthis method. Based on the results in Figure 11, the best q valueis chosen as 0.2, which yields low RRO for different Na values.

Page 8: Anomaly Detection in Feedback-based Reputation Systems …yuhong/My Paper/Anomaly... · 2010-07-06 · TAUCA, for feedback-based reputation systems. Here, TAUCA is the abbreviation

0.1 0.15 0.2 0.25 0.3 0.35 0.40

0.2

0.4

0.6

0.8

1

Recovered Reputation Offset & q

q value

Rec

over

ed R

eput

atio

n of

fset

Na = 5

Na = 10

Na = 15

Na = 20

Na = 25

Na = 30

Fig. 11: Performance of Beta model for different Na and q

5 10 15 20 25 300.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

Number of Malicious Users

Rec

over

ed R

eput

atio

n O

ffset

Recovered Reputation Offset Comparison for Beta Model & TAUCA

Beta ModelTAUCA

Fig. 12: Recovered reputation offset comparisonbetween TAUCA and Beta model

In Figure 12, we compare the RRO of Beta model and that ofTAUCA. In this set of experiments, the threshold in TAUCA ischosen such that RRO is minimized. The results in Figure 12 forTAUCA are different from these in the previous subsection. Notethat there is a tradeoff between RRO and performance of malicioususer detection. Smaller RRO usually means lower detection rateand lower false alarm rate. In Figure 12, we observe that TAUCAhas more stable performance as the number of malicious userschanges.

Furthermore, we show the detection rate and false alarm ratefor malicious user detection in Table II. It is clearly seen thatTAUCA is much more effective in terms of detecting malicioususers. When the Beta model achieves 0.37∼0.72 detection rate,the detection rate of TAUCA is between 0.86 and 0.98. The falsealarm rate is greatly reduced too. For example, when Na = 20,TAUCAs false alarm rate is 53.23% lower than that of the Betamodel.

VI. ConclusionIn this paper, a comprehensive anomaly detection scheme,

TAUCA, was designed and evaluated for protecting feedback-based online reputation systems. To analyze the time-domain

Na 5 10 15 20 25 30Beta DR 0.37 0.61 0.55 0.67 0.72 0.54

TAUCA DR 0.86 0.95 0.91 0.92 0.97 0.99Beta FA 0.57 0.57 0.57 0.57 0.57 0.57

TAUCA FA 0.02 0.03 0.03 0.04 0.04 0.04

TABLE II: Malicious user detection(DR: detection rate, FA: false alarm rate)

information, a revised-CUSUM detector was developed to detectsuspicious intervals. To remove honest ratings in the suspiciousintervals, similarity calculation and clustering techniques wereused to identify the colluded malicious users. Real user attack datawere used in performance evaluation. Compared with IR and Betamodel methods, TAUCA achieved similar recovered reputationoffset value, but much higher detection rate in malicious userdetection. Whereas IR failed to detect malicious users and Betamodel achieved 0.37∼0.72 detection rate, the detection rate ofTAUCA is 0.86∼0.98.

REFERENCES

[1] P. Resnick, K. Kuwabara, R. Zeckhauser, and E. Friedman, “Reputationsystems,” Communications of the ACM, vol. 43, no. 12, pp. 45–48, 2000.

[2] C. Dellarocas, “The digitization of word-of-mouth: Promise and challengesof online reputation systems,” Management Science, vol. 49, no. 10, pp.1407–1424, Oct 2003.

[3] J. A. Livingston, “How valuable is a good reputation? a sample selectionmodel of internet auctions,” The Review of Electronics and Statistics, vol. 87,no. 3, pp. 453–465, Aug 2005.

[4] P. Resnick, R. Zeckhauser, J. Swanson, and K. Lockwood, “The value ofreputation on ebay: A controlled experiment,” Experimental Economics,vol. 9, no. 2, pp. 79–101, Jun 2006.

[5] Press release: Online consumer-generated reviews have significantimpact on offline purchase behavior, Nov 2007. [Online]. Available:http://www.comscore.com/press/release.asp?press=1928

[6] M. Hines, Scammers gaming YouTube ratings for profit. [Online].Available: http://www.infoworld.com/d/security-central/scammers-gaming-youtube-ratings-profit-139

[7] A. Jøsang and R. Ismail, “The beta reputation system,” in Proceedings ofthe 15th Bled Electronic Commerce Conference, 2002.

[8] P. Laureti, L. Moret, Y.-C. Zhang, and Y.-K. Yu, “Information filtering viaiterative refinement,” in Europhysics Letters, pp. 1006-1012, 2006.

[9] A. Whitby, A. Jøsang, and J. Indulska, “Filtering out unfair ratings inbayesian reputation systems,” in Proceedings of the 7th Int. Workshop onTrust in Agent Societies, 2004.

[10] J. Weng, C. Miao, and A. Goh, “An entropy-based approach to protectingrating systems from unfair testimonies,” IEICE TRANSACTIONS on Infor-mation and Systems, vol. E89–D, no. 9, pp. 2502–2511, Sep 2006.

[11] P. Michiardi and R. Molva, “Core: a collaborative reputation mechanism toenforce node cooperation in mobile ad hoc networks,” in Proceedings of theIFIP TC6/TC11 Sixth Joint Working Conference on Communications andMultimedia Security, pp. 107-121, 2002.

[12] H. Yu, M. Kaminsky, P. B. Gibbons, and A. Flaxman, “Sybilguard: defendingagainst sybil attacks via social networks,” in Proceedings of the 2006conference on Applications, technologies,architectures, and protocols forcomputer communications, pp. 267-278, 2006.

[13] M. Abadi, M. Burrows, B. Lampson, and G. Plotkin, “A calculus foraccess control in distributed systems,” ACM Transactions on ProgrammingLanguages and Systems, vol. 15, no. 4, pp. 706–734, 1993.

[14] C. Dellarocas, “Immunizing online reputation reporting systems againstunfair ratings and discriminatory behavior,” in Proceedings of the 2nd ACMconference on Electronic commerce, 2000.

[15] M. Chen and J. Singh, “Computing and using reputations for internet ratings,”in Proceedings of the 3rd ACM conference on Electronic Commerce, 2001.

[16] J. Zhang and R. Cohen, “Trusting advice from other buyers in e-marketplaces:the problem of unfair ratings,” in Proceedings of the 8th internationalconference on Electronic commerce, 2006.

[17] Y. Yang, Y. Sun, S. Kay, and Q. Yang, “Defending online reputation systemsagainst collaborative unfair raters through signal modeling and trust,” inProceedings of the 24th ACM Symposium on Applied Computing, Mar 2009.

[18] Y. Yang, Q. Feng, Y. Sun, and Y. Dai, “Reputation trap: An powerful attackon reputation system of file sharing p2p environment,” in Proceedings ofthe 4th International Conference on Security and Privacy in CommunicationNetworks, Sep 2008.

[19] Y. Liu, Y. Yang, and Y. L. Sun, “Detection of collusion behaviors in onlinereputation systems,” in Proceedings of Asilomar Conference on Signals,Systems, and Computers, Oct 2008.

[20] E. S. Page, “Continuous inspection schemes,” Biometrika, vol. 41, no. 1/2,pp. 100–115, Jun 1954.

[21] T. K. Philips, Monitoring Active Portfolios: The CUSUM Approach.[22] J. B. MacQueen, “Some methods for classification and analysis of mul-

tivariate observations,” in Proceedings of 5-th Berkeley Symposium onMathematical Statistics and Probability, pp. 281-297, 1967.

[23] [Online]. Available: http://www.ele.uri.edu/nest/cant.html[24] [Online]. Available: www.douban.com