33
Master Project Report Identifying Fabricated Online Transactions with Collective Outlier Detection based on Mixed Attributes Author: Cunqing Liu Advisor: Dr. Jing Deng

Master Project Report Identifying Fabricated Online ... Project Report Identifying Fabricated Online Transactions with Collective Outlier Detection based on Mixed Attributes Author:

  • Upload
    dophuc

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Master Project Report

Identifying Fabricated Online Transactions with

Collective Outlier Detection based on Mixed Attributes

Author: Cunqing Liu

Advisor: Dr. Jing Deng

Abstract

E-commerce has forever changed everyone‟s shopping habits with its

convenience. While fraud in e-commerce often occurs in the form of credit card fraud,

social spam, and fraud in electronic business we focus on transaction fabrication in

this. These seemingly legitimate transactions are faked to show off product‟s

popularity, implicitly advertising the quality of product and elevating seller‟s

reputation rate. We explore a density-based collective outlier detection approach to

detect such fabricated online transactions. We describe our experiments based on a

real dataset and show that the method can take advantage of mixed attributes in

mining special relationships between fabricated transactions, and thereby distinguish

them from normal transactions. When adopting supervised method to process

training set, we also demonstrate an approach to quickly decide multiple parameters

in the model and show that the model performs well in both training set and test set.

1. Introduction

Online shopping is very popular nowadays. People do not have to go to a

shopping mall to buy shoes and cloths, instead, sitting in front of a laptop and

looking through the description, selling records and customers‟ comments in the

webpage of a commodity may lead to a purchase. And most of these purchases are

shown as transactions online one or the other. It is thus interesting for others to see

these before a new transaction is made. However, there are sometimes fabricated

ones. Most of such fabricated transactions are directed to increase sales by

misleading the customers to believe how „hot‟ the products are in selling. It is

therefore important to identify such fraudulent transactions.

Generally, one of the motivations of an individual seller to make fake transactions

is that the commodities are new and need good statistics to strong ranking such that

they are more likely to appear on the customers‟ search result pages. This is due to

E-commerce companies‟ internal information management systems and their

mechanisms of displaying large number of search hits. Another motivation is to make

some seed transactions. Here, we call a transaction a seed transaction when the

probability of purchasing from other customers increases after they look through the

seed transaction records. Notice that not every E-commerce website shows the

transaction history. A seller may have someone to buy a few numbers of commodities

online and leave with positive feedbacks. Because E-commerce companies have cheat

detection of their own, so many fake transactions go with real package or box

shipping but no real commodities inside. In addition, each buyer will purchase

several products with different styles like red color, yellow color to make the cheating

less obvious than buying all the same product and one style at a time.

In this work, we design a method of detecting fabricated online transactions on a

C2C online transactions data that crawled from Taobao, the most popular C2C online

site in China. The method we adopt is density-based collective outlier detection in

mixed-attribute data sets. We take advantage of the correlation between the

fabricated transactions that are fairly different from that of normal transactions.

Each transaction has features (attributes) like transaction date, product, product info.,

product quantity, seller, buyer etc. We utilized mixed attributes as the input of our

model and tested our approach on experiment data and reached a good F1 score.

2. Related work

Research and development works on online social network have been carrying out.

There are some works on analyzing systematical problems and frameworks such as

[1][5]. Song et al. in [1] proposed an adaptive deception detection framework that

consists of steps and techniques to automated detection of deceptive behavior in

social media. The framework focuses on better utilizing existing techniques. For

example, it has a defined step to automatic select method such as statistical modeling

method, clustering-based method and classification-based method.

Post et al. [5] presented a system to address online marketplace reputation

manipulation conducted by malicious users. In their work, the proposed system

calculates user reputations using a max-flow-based technique over the network

formed from prior successful transactions. They evaluated with data collected from

eBay and demonstrated that the system is able to bound the amount of fraud, rarely

impacting non-malicious users.

Zombie user account detection is also investigated by many works [13][14]. In [13],

Deng et al. presented a ZLOC scheme to detect zombie users. The scheme provides an

efficient way by taking advantage of location information and does not rely on

detailed information of suspicious users. Wu et al. used an effective probabilistic

model in [14] that takes local user features and the neighbor‟s information as input to

detect marionette microblog users.

There are also studies on spam detection [16][17][18][19][20]. User behavior and

post contents are often considered features for classifiers in spammer detection on

communities such as Twitter.

Lee et al. in [16] used a honeypots approach to collect information from

spammers as input for the classifier. The framework in their research is to support

ongoing and active automatic detection of new and emerging spammers. They used

honeypots to gather information from both MySpace and Twitter communities for

experiments and found evidence that social honeypots can attract useful information

of spam behaviors that distinguish the spammers from legitimate users.

Another work [17] demonstrated detection of spammers in Twitter based on a

large amount of collected data. Benevenuto et al. in [17] built the training dataset by

manually classifying users based on three famous trending topics in 2009. Then they

selected a number of characteristics related to tweet content and user social behavior

as features to classify spammers from normal users. They took advantage of both

tweet content and user social behavior and studied the content attributes and user

behavior attributes respectively and then to perform the support vector machine

(SVM) as the classifier to successfully classify approximately 70% of spammers and

96% of non-spammers. A similar spam detection on Twitter [18] also used user-based

features and content-based features for classification. In the work, McCord et al. also

defined statistics that are based on the percentage distribution of tweets in every

3-hour periods within a day, with the conjecture that spammers tend to be most

active during the early morning hours while regular users are more likely sleeping.

Their content features include number of URLs, replies or mentions, keywords,

retweets, average tweet length and so on.

There are some other works performed the spam detection in Twitter with other

methods. In [19], Wang used a directed social graph model to explore the follower

and friend relationships among Twitter. Beside the novel content-based features as

above works, graph-based features were also used in [19].

In [20], Jin et al. introduced a GAD clustering algorithm for a scalable and online

social media spam detection. The features they used include image content features,

text features and social media network features. Since their text features are extracted

from image-associated content such as caption, description, comments and URLs, so

the first two types of features are both related to photos. Their third social network

features consist of individual characteristics of user profiles and user behaviors. The

image content features and text features are related to photos and different from

above works. This could help for classification of spammer in social media as photos

appear more and more as part of content.

There have been works on electronic commerce applications. Dong et al. in [6]

discussed in-auction fraud comparing to pre-auction and post-auction frauds. The

work listed characteristics of auctions that may involve frauds such as auctions with

shills have more bides on average than those without shills and average minimum

starting bid in an auction with shills is less. And also shill bidding patterns are

analyzed as several types such as acting alone, seller collusion and accomplice. Acting

alone is that a seller carries out shill bids by himself or herself. Seller collusion is that

several sellers help each other to place bids on each others‟ transactions for their

mutual benefits while accomplice is for the case that a seller hires or invites family

members and friends to serve as shills who will place bids on the seller‟s item, but

instructs them to avoid winning. While these patterns share some common

characteristics, such as no true buyers, with fabricated online transaction in our

scenarios, however, they are different in many aspects like the frauds in auctions

their described tends to increase the final price in deal.

Ku et al. in [3] used social network analysis (SNA) and decision tree classification

to detect online auction fraud. In their work, their dataset came from randomly

selection of “black listing IDs” and legitimate IDs in Yahoo auctions and related

features include positive reputation rate, frequency of positive or negative

reputations, detail information of merchandise, 2-cores or 3-cores IDs by SNA, etc.

Thus user related features and graph related features are used in their classifier

which is implemented as decision tree method.

Xiong et al. in [2] did fraud transaction classification on Taobao dataset with

basic classification methods of Naïve Bayesian, decision tree and AdaBoost methods.

When dealing with imbalanced dataset, they used under-sampling techniques to

reduce the majority size and make the dataset suitable for their classifiers. In their

implementation, majority voting is used in their classification models that take

advantage of different techniques. While it is a good way to adopt majority voting

among multiple classifiers, the difficulty lies in that underrepresented majority may

change the distribution characteristics and related minor objects usually are very

limited.

Some works focus on early fraud detection such as [4]. Chang et al. in [4]

proposed an early fraud detection method that classifies normal traders and

fraudsters. They selected a subset of attributes from large candidate attribute pool

and utilized a complement phased modeling procedure to extract the features of the

latest part of trader‟s transaction histories such that the time and resources needed

for modeling and data collection can be reduced.

In [7], Chau et al. investigated fraudulent personalities detection based on user

level features that represent user account statistical activities and network level

features that represent the interactions with others. Their user level features include

user related information such as number of transactions, average price of goods

exchanged. Transactions are modeled as graph, with a node for each user and an edge

for all the transactions between two users. They also presented several models to

handle both user level features and transaction data model. The representation of

transaction history as graph method was also used by Chau et al. in [8]. Chau et al. [8]

used exposed fraudsters‟ transaction history to select features for decision tree

classifier in fraud detection of online auction.

Other directions of fraud detection include detection of fake customer reviews

[9][11], processing missing information with multinomial Logit model [10] and using

location to help the fraud detection in ecommerce transactions such as [12] that

studied the feasibility of using location in ecommerce transactions to help fraud

detection.

In summary, the usual way is to use classifier on the features or combine with

social network analysis. But they may subject to limited feature information and

imbalanced dataset nature as well as mixed attributes. In this work, we modeled the

problem to be outlier detection issue and explored an approach to mining in

imbalanced dataset with mixed attributes for the special relationship between

fabricated transactions based on transferring from feature space to density space.

With proximity-Based techniques and statistical tool, our approach demonstrates

some interesting favorable results.

3. A density-based collective outlier detection approach

3.1 Main processing design

3.1.1 Background

Before start explaining our model, we review several techniques such as

classification, clustering and outlier detection. Then we will introduce the workflow

on our experiment data.

Classification is known as supervised learning as the supervision in the learning

comes from the labeled examples in the training dataset. Related techniques include

decision tree, Bayesian methods, support vector machine (SVM), neural network, and

so on. In contrast, clustering is known as unsupervised learning as the learning

process is unsupervised - the input examples are not labeled into different classes. Its

analysis methods include partitioning method, hierarchical method, density-based

method, probabilistic model-based method, and so on. Both of these are popular

research area in data mining and machine learning.

Outlier detection or anomaly detection is the process of finding some small

numbers of data objects with behaviors vastly different from majority of the data.

Such data objects are called outliers or anomalies. It is a different topic from

classification and clustering, but it may reuse some common techniques applicable in

classification and clustering. The common goal of outlier detection is usually to

identify unique cases that deviate substantially from majority patterns. On the other

hand, clustering is to find the majority patterns on the dataset. In practice, outlier

detection is important in many applications such as fraud detection, medical care,

public safety and security, industry damage detection, image processing,

sensor/video network surveillance, and intrusion detection.

Now let us look at basic methods for outlier detections. From the viewpoint of

data labeling, there are supervised methods and unsupervised methods that deal with

dataset with class labels and dataset without class labels, respectively. For the former,

the processing of the outlier detection on labeled dataset can be modeled as a

classification problem in general. However, the basic methods from classification

such as those we mentioned above will not work well because the dataset for outlier

detection is highly imbalanced. The number of normal samples highly outnumbers

the number of outlier samples, which tends to make the classifier biased. For the

latter, i.e., when the dataset is not labeled, the clustering methods may be used and

the problem becomes searching for the outliers that do not belong to any clusters.

However, it generally depends on how successfully it is to identify the clusters first

and will be a costly task. Clustering algorithms are optimized to find clusters rather

than outliers. In addition, unsupervised methods cannot effectively detect collective

outlier since the objects in a small outlier group share similarity and form a cluster

too.

From the viewpoint of outlier detection assumptions made during the detection

process, outlier detection methods are categorized into three types: statistical

methods, proximity-based methods, and clustering-based methods. Statistical

methods assume that the normal objects in a data set are generated by a stochastic

model. Consequently, normal objects occur in regions of high probability for the

stochastic model, and objects in the regions of low probability are outliers.

Proximity-based methods use distance measures to quantify the similarity between

objects and hence the objects that are far from others can be regarded as outliers.

They assume that the proximity of an outlier object to its nearest neighbors

significantly deviates from the proximity of the object to most of the other objects in

the data set. There are two types of proximity-based outlier detection methods:

distance-based and density-based methods, which use a radius and k-distance for

neighborhood, respectively. Clustering-based approaches detect outliers by

examining the relationship between objects and clusters.

Now let us look at the types of outliers. In general, there are three categories,

namely global outliers, contextual (or conditional) outliers, and collective outliers.

For a global outlier, it deviates significantly from the rest of the data set. Global

outliers are sometimes called point anomalies, and are the simplest type of outliers.

Most outlier detection methods are aimed at finding global outliers. In a given

dataset, a data object is a contextual outlier if it deviates significantly with respect to

a specific context of the object. A subset of data objects forms a collective outlier if the

objects as a whole deviate significantly from the entire data set. For example, it is

normal for a stock transaction carried on between two parties. However, a large set of

transactions of the same stock among a small party in a short period are collective

outliers because there may be evidence of some people manipulating the market.

3.1.2 Problem analysis

Finding fabricated transactions could be regarded as identifying the outliers or

anomalies. As we know, there are three types of outliers in data mining: global

outliers, contextual outliers and collective outliers. Based on our analysis of the

scenarios, globally identifying a single fake transaction is not a good choice. If a

group of objects together form an abnormal pattern as in Fig. 1, that is the group

significantly differentiates itself from other objects with an unusual character, then

this group of objects as a whole is a collective outlier. In our case, the transactions

conducted for a seller together form a set of fabricated objects and hence is a

collective outlier. Neither on individual product nor on individual transaction can we

effectively detect the cheating behavior; the sellers of several fabricated transactions

will also do normal business and maintain multiple products and a shill may also

really shop for himself or herself other time.

For products in different genres, the probability of frauds will also change from

one to another. Our collection of transactions mainly related to one genre of product

and occurred during a period of time at a length of eight months. Attributes include

transaction date, product, product information, product quantity, seller, buyer etc. If

we consider that fabricated transactions had not come from real demands of random

customers, there could be high correlation among themselves. If we draw the

transactions as objects in a map, the fabricated ones are probably congest in a small

group and thus lead to high local abnormality (see Figure 1).

Fig. 1. A collective outlier example

When we structure the problem as this type of issue, to identify the abnormal

group, we first should discover the relationships between the multiple objects inside

the group. This is a non-trivial work and exploring structure of relationship usually

has to be part of the outlier detection process [22]. If we use the unsupervised

methods, that are the clustering-based methods, then it becomes an expensive

process. If we use supervised methods, which means we need to label the dataset,

problem transforms to classification issue. However, recall that the nature of dataset

in outlier detection makes it hard to directly apply common classification methods on

collective outlier detection. Oversampling technique maybe used to handle the

imbalanced dataset, but it will directly change the characteristics of outliers especially

change the structure of relationship in collective outliers. Also our dataset has mixed

attribute types, including numeric values and large number of discrete or symbolic

values such as seller, buyer and products, and this makes it hard to apply common

classification methods. In addition, there is no explicit temporal or spatial structure

in our problem such that stochastic approaches often used for collective outliers are

not expected to work well.

Based on above analysis, we adopt a supervised method to discover step-by-step

the structure of collective outliers in our dataset. First, we label the dataset with

domain knowledge in our training dataset. Second, we use a mixed-attribute distance

measure model to represent the relation between objects in our training dataset and

leave the model parameters to be decided according to what we can get in processing

our dataset. We use the density-based technique and logistical regression to be our

rest part of classifier to make use of distance measure model. We use the local

reachability density to test for possible distance model parameters on our training

dataset. And the logistic regression is to do further process on the result data from

local reachability density computation. The logistic regression model classifies

objects on their density space rather than on the raw feature space.

The local reachability density, however, will involve additional setting for

neighborhood. Therefore we select a set of expected values for neighborhood and,

together with distance model settings, investigate them with ability to separate

collective outlier from normal objects and select an empirical setting for all model

parameters. Thereafter we can directly compute object feature values into object

density values and classify them with logistic regression model.

After we have made the logistic regression model work at the best options, we

validate neighborhood settings and make modification if current value is not coming

with best final performance. Finally, we evaluate the method with our test dataset

and objectively evaluate our experiments. The main workflow is show in figure 2.

Data preprocess

Label datasets (supervised methods)

Training

datasetTest dataset

Model of distance

measurement

Density-based method

Logistic regression

Feature space

Density space

classifier

Relationship structure

Feature space

Neighborhood setting

Parameters investigation

training evaluation

Fig.2. workflow of data processing

3.2 Model building

3.2.1 Distance measure model

To measure the distance or similarity between data objects, we should first

investigate the datasets containing the transactions information here. This requires

background knowledge for specific applications. From a list of features, we selected

some attributes that are computable in our model to fill our dataset as Table 1.

Table 1.

Attributes of the Transaction

Attribute Explanation Examples

id Transaction unique id 0,1,2…

seller The offer of this transaction Ivymagg**,maggie505**

location Location of seller Beijing,Shanghai

regTime Seller‟s registration time 2006-05-31

product Product unique id 10255280939

price Price of product 1550, 997.5

pstyle Style of product Black color, default style

quantity Quantity of products in

transaction

1, 2, 3…

buyer Buyer in this transaction d**u, g**n

date Occurred date of

transaction

2011-12-03

Instead of using Euclidean distance or standardized Euclidean distance, we use

weighted distance to measure mixed attribute types and define the distance measure

function as following:

Let T1, T2 be two transactions and let d(T1,T2) be the distance of T1,T2. For any

𝑇𝑖 , intuitively it is a row in the data set table, and formally noted with

𝑇𝑖 = (sr,loc,rtg,p,st,pr,q,br,dt)

where sr,loc,rgt,p,st,pr,q,br,dt are attributes for Seller, location, regTime, product,

pstyle, price, quantity, buyer and date respectively.

Now let A is the set of all attributes, 𝑎 is one attribute:

A = { sr,loc,rgt,p,st,pr,q,br,dt }, 𝑎 ∈ A,

then 𝑇𝑖(𝑎) is the value of attribute 𝑎.

We have the first definition of d(𝑇𝑖 , 𝑇𝑗):

d(𝑇𝑖 , 𝑇𝑗) = ∑ 𝑤𝑛

𝑎∈𝐴

∙ 𝐷 (𝑇𝑖(𝑎), 𝑇𝑗(𝑎)) + 𝑤0

where n is the index number of 𝑎 in A and D(x,y) is defined as:

D(x,y) = 𝐵(x, y) = {1, x ≠ y0, x = y

for a ∈ 𝐴1 = { sr,loc,p,st ,br }

D(x,y) = |(x − y) 𝑠𝑎⁄ | for a ∈ 𝐴2 = { dt,pr,rgt,q }, where 𝑠𝑎 is

the range of values of attribute 𝑎 and aim to normalization.

Then for a complete form of d(Ti, Tj) if we replace 𝑎 from above:

d(𝑇𝑖 , 𝑇𝑗) = ∑ 𝑤𝑛 ∙𝑎∈𝐴1𝐵 (𝑇𝑖(𝑎), 𝑇𝑗(𝑎)) + ∑ 𝑤𝑛 ∙𝑎∈𝐴2

|(𝑇𝑖(𝑎) − 𝑇𝑗(𝑎)) 𝑠𝑎⁄ | +𝑤0

This definition of distance is to combine the numerical, discrete and nominal

attributes together in a simple and computable way. The attributes are in mixed type

and they represent different information in different concept, in order to address this

factor, they are weighted with parameter on their distances. We did not use Euclidean

distance because it cannot handle nominal values appropriately [28].

3.2.2 Density-based method:

The metric of local relative density of an object we used is local reachability

density here. Then we use the definition of local reachability density (denoted by 𝜌)

of an data object, which is a transaction here, as :

𝜌𝑘(𝑜) =‖𝑁𝑘(𝑜)‖

∑ 𝛿𝑘(𝑜′ ← 𝑜)𝑜′ ∈ 𝑁𝑘(𝑜)

, where the 𝛿𝑘(𝑜 ← 𝑜′) is the reachability distance from 𝑜′ to o :

𝛿𝑘(𝑜 ← 𝑜′) = max * 𝑑𝑘(𝑜), d(𝑜, 𝑜′)+ ,

k-distance of o, denoted by 𝑑𝑘(𝑜), is the distance between o and its k-nearest

neighbor (kNN). Thus k is in control of the size of neighborhood, but not necessarily

the actual size of a neighborhood. This is because there may be some neighbors in the

same distance. For example, if k is 3, and object x has more 4 neighbors with the

same distance as its 3rd nearest neighbor, thus object x totally has 7 neighbors.

Given a set of object D, the k-distance neighborhood of object o, 𝑁𝑘(𝑜) is defined as:

𝑁𝑘(𝑜) = *𝑜′|𝑜′ ∈ 𝐷, d(𝑜, 𝑜′) ≤ 𝑑𝑘(𝑜)+

To better understand reachability distance, we now have a look at the asymmetry

property. Reachability distance is not symmetric, that is generally:

𝛿𝑘(𝑜′ ← 𝑜) ≠ 𝛿𝑘(𝑜 ← 𝑜′)

Even though d(𝑜, 𝑜′) = d(𝑜′, 𝑜), however generally 𝑑k(𝑜′) ≠ 𝑑k(𝑜) which means that

objects o and 𝑜′ have different distances to their k-nearest neighbors. For example

k=3 as shown in figure 3, d(𝑜, 𝑜′) = d(𝑜′, 𝑜), however, object o has its 𝑑3(𝑜) at

length of R1 while object 𝑜′ has its 𝑑3(𝑜′) at length of R2. As a result, the

reachability from 𝑜′ to o is δ3(𝑜 ← 𝑜′) = max * 𝑑3(𝑜), d(𝑜, 𝑜′)+ =R1. But the

reachability from o to 𝑜′ is

𝛿3( 𝑜′ ← 𝑜 ) = max * 𝑑3(𝑜′), d(𝑜′, 𝑜)+= d(𝑜′, 𝑜) < R1

O

O′d( o, o′)

R1

R2

Fig.3. Asymmetry of reachability distance

We use the local reachability density (𝜌𝑘) as the metric of collective outlier. In

following content, we use 𝜌 as a shorthand for 𝜌𝑘, and density also means 𝜌𝑘.

The higher of density for a data object (a transaction) and lower of other local

densities, the high possibility this object belongs to a collective outlier. Then setting a

proper neighborhood is an important procedure. Here we use the 𝜌 rather than the

LOF (local outlier factor) which is defined as following.

𝐿𝑂𝐹𝑘(𝑜) =

∑𝜌

𝑘(𝑜′)

𝜌𝑘

(𝑜)𝑜′ ∈ 𝑁𝑘(𝑜)

‖𝑁𝑘(𝑜)‖

The main reason is that LOF is suitable for identify isolated or low density objects

that too isolated to be included in any groups. So are other similar factors like LOF′

or LOF′′ [26]. Since we are identifying the collective outliers, we are actually

distinguishing minor groups with high local density. And different from normal

classification, we switch the feature space to density space thereafter the correlation

between objects could be measured based on density.

3.2.3 Logistic regression

To separate a minority from clusters, we set a cut-off for the density and the

minor objects (transactions) that on one side of the cut-off will be considered as the

expected ones. We employ the Logistic regression to classify densities into two classes,

one is the majority and the other is minority. Let Yi=1 if the transaction is fabricated:

𝑙𝑜𝑔 (𝑃(𝑌𝑖 = 1)

1 − 𝑃(𝑌𝑖 = 1)) = 𝛼 + 𝛽𝑋𝑖

Where 𝛼 is the intercept, 𝑋𝑖 (𝜌 of ith object) are the estimates, and the probability

for transaction to be fabricated is:

𝑃(𝑌𝑖 = 1) = 𝑒(𝛼+𝛽𝑋𝑖)

1+𝑒(𝛼+𝛽𝑋𝑖)

We use a default threshold (probability cut-off) of p=0.5 to decide the class for an

object.

4. Implementation and performance evaluation

4.1 Performance metric

There are metrics of performance such as true positive, false positive, true

negative, false negative, precision, recall, and F1 score. We use F1 score for the

comprehensive measure of performance:

F1 = 2precision ∙ recall

precision + recall

where the precision or positive predictive value is

precision = True Positive(TP)

True Positive(TP)+False Positive(FP) ,

and recall or true positive rate is

recall = True Positive(TP)

True Positive(TP)+False Negative(FN) .

Fig.4. Precision and Recall illustration

In our study, true positive is correctly detected fabricated transaction; false

positive is incorrectly detected fabricated transaction; and false negative is incorrectly

missed fabricated transaction. Intuitively, precision is the rate of correct fabricated

transactions in all detected fabricated transactions and recall is the rate of correct

fabricated transactions detected in all fabricated transactions including those not

detected. Since the two classes, fabricated transactions and normal transactions, are

high imbalanced, we concern the F1 score on relevant class, which is the class 1 that

contains fabricated transactions.

4.2 Processing on training dataset

Our datasets were crawled from Taobao and we select for a genre of product in

2011-2012 that had about 20,000 transactions. Our tools collected the public

information of online transaction records, each of which includes the buyer, purchase

date, product details like name, price, quantity and style, seller, seller‟s registration

date, location, reputation, and so on.

To set up a training dataset, we randomly selected a time period of ten days and

collected the transactions during this period, getting 634 transactions totally. Then

we manually checked these transactions whether they were fabricated and labeled

them based on normal purchase scenarios and all information we have For example,

customers are unlikely to purchase multiple expensive products for the same purpose

at one time, and a transaction with negative feedback is unlikely to be fabricated.

To compute the densities, we first have to decide the k value for k-distance

neighborhood. The range of k-distance neighborhood we selected is [3, 10) according

to the amount of objects in experiment dataset.

Secondly, to find a rational setting for the parameters of d(Ti, Tj), which are the

weight coefficient set of 𝑤𝑛, n ∈ [0,9], we grouped our training dataset into two

clusters according to their class label: o is normal transaction and 1 is fabricated

transaction. Each cluster has a centroid to represent itself and the distance between

these two centroids is used to represent the distance between their clusters. The

mean of densities is used as the centroid here since we will classify the objects in

density space according to their density values.

Then for each candidate neighborhood size k we studied the independent relation

between distance of centroids and all wn‟s, as show in figures 5~11, where SR, LOC,

RGT, P, ST, PRC, Q, BR, DT are the weights (wn, n ∈ [1,9]) for attributes of Seller,

location, regTime, product, pstyle, price, quantity, buyer and date respectively.

Fig. 5. k = 3 Fig. 6. k = 4

Fig. 7. k = 5 Fig. 8. k = 6

Fig. 9. k = 7 Fig. 10. k = 8

Fig. 11. k = 9

In the above investigation as shown in figures 5~11, Based on the different extent

on which each 𝑤𝑛 has effect, we empirical specified values for 𝑤𝑛 parameters that

represent the different importance of every attribute. As shown in Table 3, the first

group of 𝑤𝑛 setting is the initial values on which we calculated figure 12.

We further investigated different sizes for 3 to 9 for k-distance neighborhood by

demonstrating with the characteristics of the two clusters. For each cluster we

computed the centroid value and we summarized the within-cluster standard

variation, a sum of each cluster‟s standard variation. From figure 12 we can find that

selecting a rational k is a non-trivial step in relative density method. When set k = 6

or larger, even though the standard variation will drop, the difficulty to separate the

clusters will increase since they will become much closer.

Fig. 12. k-distance neighborhood selection

Considering the size of collective outlier, we tried to avoid a too small number

when keeping a good performance. A too small neighborhood may not help the big

abnormal neighborhoods to stand out and therefore we selected 5 for the value of k.

After we had decided a whole setting for the parameters of model, for each object in

our dataset we computed density (𝜌), which is its local density with respect to a

neighborhood. Finally, the logistic regression model classified the objects according

to their densities and a default probability threshold of p=0.5.

On the training dataset, we computed 100% precision and 81% recall for relevant

class (class 1), which provided a F1 score of 89%.

Table 2

Performance on training dataset

class precision recall F1-score support

0 0.99 1.00 1.00 613

1 1.00 0.81 0.89 21

Fig. 13 shows the density in (density, transaction, date) coordinates, where the

green circles are fabricated transactions that are successfully detected and blue

squares are fabricated transactions that are not detected, while red markers are

normal transactions.

Fig. 13. Densities on training data

We found that in the result space, the fabricated transactions can be divided into

3 groups: the group that with density of around 16 that is much higher than others, a

group with slightly higher density and a group (in blue squares) that are inside the

majority of the red dots hence were not detected. A detailed figure is show in Fig. 14.

Fig. 14. Cut-off of densities in on training data

For our empirically-assigned weight values, we list 5 groups of setting as shown

in Table 3. The first group is the initial values based on which we selected k =5. To

further refine these weights, we tested 4 more different groups and found that the 5th

group is the best with F1 score, therefore is selected as the final refined setting for 𝑤𝑛.

Table 3.

Empirical weights of distance function

n

𝑤𝑛 values (5 groups)

1 2 3 4 5 attribute

0 0.05 0.05 0.05 0.05 0.05 constant

1 0.20 0.25 0.30 0.25 0.20 seller

2 0.10 0.05 0.1 0.08 0.07 location

3 0.12 0.15 0.20 0.20 0.18 product

4 0.10 0.06 0.10 0.06 0.04 pstyle

5 0.40 0.50 0.7 0.65 0.60 buyer

6 0.05 0.04 0.06 0.05 0.03 date

7 0.05 0.04 0.08 0.06 0.05 price

8 0.10 0.15 0.13 0.15 0.10 regTime

9 0.10 0.08 0.06 0.06 0.05 Quantity

F1 0.20 0.19 0.38 0.69 0.89

4.2.1 Validate k selection

Now that we have computed final performance with the value we selected for

k-distance neighborhood, we further validate all previous investigated values k ∈

[3,10). As following figure 15 shown, the F1 score, recall and precision are all at the

best when k is 5, and recall is undermined while precision declines rapidly as k

continue increases. The main reason is that when the neighborhood is set too large

than expected, the density difference between the normal objects and target objects

will decrease and hence make it hard to correctly separate these two kind of objects.

From this demonstration, it validates that we have the right setting for k-distance

neighborhood for training dataset, and we anticipate that this value is effective for

test dataset too.

Fig. 15. Performances with different k-distance neighborhood

4.3 Evaluation on test dataset:

After had finished a logistic regression, we saved the mode data and then applied

it to predict on a test dataset with a size of 980 that we build in the same way as

training dataset. The 75% recall and 100% precision make the F1 score to be 86%. As

following table 3, where class 1 is for fabricated transactions, it is the outcome of

classification from regression model.

Table 4.

Performance on test dataset

class precision recall F1-score support

0 0.99 1.00 0.99 931

1 1.00 0.75 0.86 48

Fig. 16. Densities on test data

From the Fig. 16, we can still observe there is a small group with very high

density compare to other objects. When trace back to the raw dataset, it showed that

these objects are identical on almost all the features, such that they together come up

with very high density value. Other objects in green circles are also detected target

objects, but they have partially identical or close feature values that can make them

stand out of other below objects. For example, some of them may have similar prices

with same product style and are launched by a same buyer. The red markers and few

blue squares are the majority normal objects and undetected targets respectively.

From the Fig. 17, even though the F1 score is good, it is obviously that the cut-off

of densities is very close to the surface of a majority group of normal objects, which

means that the performance will be highly sensitive to a bad cut-off, a cut-off that

tend to increase the recall of target objects. As we know that recall is the number of

correctly detected targets divided by total number of targets, thus, in order to

improve recall, an adjusted cut-off should try to include the blue squares which are

missed targets and therefore many other objects in red near the cut-off will be

incorrectly included. Then the precision will significantly decline but recall is just

improved a little as a result the F1 score will rapidly decrease. As this situation, a

good performance on the test dataset highly relies on how precisely the training find

a rational cut-off.

Fig. 17. Cut-off of densities in on test data

Now we check whether the neighborhood setting is still optimal in test dataset.

As performance of test dataset is given by the classifier learning from training dataset,

so we have to retrain the model each time evaluate a different neighborhood size k.

As shown in figure 18, the performance is the best on the same k value of 5 and thus it

is optimal.

Fig. 18. Performances with different k-distance neighborhood on test dataset

4.4 Discussion

We discuss several points through our results in this section including deciding

parameters, handling more attributes and improvement of nominal attributes

computation.

First is about how we solved the model, especially neighborhood selection and

weights specifying. Currently there is no effective way to directly find the best

neighborhood size and our selection method utilizes prior knowledge that

represented as our supervised labeling of dataset. As for specifying weights, the way

we could investigate independently for each feature is due to our mixed attributes

distance function, which linearly summaries the distances in each feature space.

Second, our experiment datasets do not contain rich user account information,

however our model can also handle rich user information such as delivery location,

most often IP address, Credit card to better detect fabricated transactions by

manipulated multiple user accounts. The addition of such information is expected to

increase the performance of our detection method.

Third is about the nominal value computation. When we were building the

distance measurement, the nominal values were computed in a form of comparing

equality. However, consider a situation for example as in figure 19 that shows a part

of products in Amazon.com, product „iPhone‟ should be closer to „Samsung Galaxy‟

than to „MacBook‟ even iPhone and MacBook are belong to Apple. But if we just

compare the nominal value, then all of them are not equal with each other. Thus

much sematic information is missed but nowadays most information is stored in

databases of information system and many attributes are in nominal format.

ANY

Electronics & ComputersBooks

Electronics Computers

Cell Phones & Accessories...

...

TV&Video

...Cell Phones

iPhone Samsung Galaxy...

Laptops & Tablets ...

ChromeBook MacBook...

Fig. 19. Example of concept hierarchy of Amazon products

There are some methods to address computing distance of nominal attributes

such as the Value Difference Metric (VDM) [29], which mainly depends on the

occurrence probabilities of each nominal value in dataset. In [28], Wilson et al.

presented a heterogeneous distance function to take advantage of VDM and

Euclidean distance to handle nominal and numerical attributes together. Hsu et al.

used a distance hierarchy in [27] that involved concept hierarchy to help compute

mixed attributes distance. Involving knowledge representation (KR) of background

domain knowledge in mixed attributed similarity computation could improve

information utilization.

5. Conclusion

We have described a density-based collective outlier detection approach in

mixed-attribute datasets to detect fabricated online transactions. First we analyzed

the typical scenarios of how fabricated transactions take place in our applications in

C2C E-commerce. Based on the characteristics of these frauds and our analysis, we

modeled the problem to collective outlier detection. After preprocessed and labeled

the dataset, in the first place, we built a mixed attributes distance computation model

to measure the similarity of transactions and adopted density-based method to

convert the feature space to density space later.

When searching for a rational setting for all parameters, we chose a range of size

for k-distance neighborhood and a set of candidate neighborhood sizes. After

independently investigating the importance of each attribute on different candidate

neighborhood size, we set the weights of attributes in distance function. Then with

regarding the transactions as two clusters: normal and fabricated classes according to

the label, we decided the neighborhood size in terms of its ability to separate these

two clusters. In global densities, we use logistic regression method to classify the

transactions. At the end of learning on training dataset, we validated the

neighborhood size k with F1 scores.

Finally we objectively evaluated the method on test dataset. Even though there

are issues such as we listed in discussion that can be improved in future work, we

came up with a good result of our experiments.

REFERENCES

[1] L. Song, W. Zhang, S. S.Y. Liao, and R. C.W. Kwok, "A Critical Analysis Of The

State-Of-The-Art On Automated Detection Of Deceptive Behavior In Social Media." In PACIS,

p. 168. 2012.

[2] H. Xiong, Y. Ren, and P. Jia, "A Novel Classification Approach for C2C E-Commerce Fraud

Detection." International Journal of Digital Content Technology and its Applications 7, no. 1

(2013): 504.

[3] Y. Ku, Yuchi Chen, and C. Chiu, "A proposed data mining approach for internet auction

fraud detection." In Intelligence and Security Informatics, pp. 238-243. Springer Berlin

Heidelberg, 2007.

[4] W. Chang, and J. Chang, "An effective early fraud detection method for online

auctions." Electronic Commerce Research and Applications 11, no. 4 (2012): 346-360.

[5] A. Post, V. Shah, and A. Mislove, "Bazaar: Strengthening user reputations in online

marketplaces." In Proceedings of NSDI’11: 8th USENIX Symposium on Networked Systems

Design and Implementation, p. 183. 2011.

[6] F. Dong, S. M. Shatz, and H. Xu, "Combating online in-auction fraud: Clues, techniques and

challenges." Computer Science Review 3, no. 4 (2009): 245-258.

[7] D. Chau, S. Pandit, and C. Faloutsos, "Detecting fraudulent personalities in networks of

online auctioneers." In Knowledge Discovery in Databases: PKDD 2006, pp.103-114. Springer

Berlin Heidelberg, 2006.

[8] D. Chau, and C. Faloutsos, "Fraud detection in electronic auction." In European Web

Mining Forum at ECML/PKDD, pp. 87-97. 2005.

[9] N. Hu, L. Liu, and V. Sambamurthy, "Fraud detection in online consumer reviews." Decision

Support Systems 50, no. 3 (2011): 614-626.

[10] S. B. Caudill, M. Ayuso, and M. Guillén, "Fraud detection using a multinomial logit model

with missing information." Journal of Risk and Insurance 72, no. 4 (2005): 539-550.

[11] A. Mukherjee, V. Venkataraman, B. Liu, and N. Glance, Fake review detection:

Classification and analysis of real and pseudo reviews. UIC-CS-03-2013. Technical Report,

2013.

[12] J. Hayhurst, and D. P. Mundy, "Using Location as a Fraud Indicator for eCommerce

Transactions."

[13] J. Deng, L. Fu, and Y. Yi, "ZLOC: Detection of Zombie Users in Online Social Networks

Using Location Information."

[14] X. Wu, Z. Feng, W. Fan, J. Gao, and Y. Yu, "Detecting marionette microblog users for

improved information credibility." In Machine Learning and Knowledge Discovery in Databases,

pp. 483-498. Springer Berlin Heidelberg, 2013.

[15] J. Deng, X. Gao, C. Wang, “Using Bi-level Penalized Logistic Classifier to Detect Zombie

Accounts in Online Social Networks,” University of North Carolina at Greensboro, Jilin

University, 2015.

[16] K. Lee, J. Caverlee, and S. Webb, "Uncovering social spammers: social honeypots+

machine learning." In Proceedings of the 33rd international ACM SIGIR conference on

Research and development in information retrieval, pp. 435-442. ACM, 2010.

[17] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida, "Detecting spammers on twitter."

In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), vol. 6, p. 12.

2010.

[18] M. Mccord, and M. Chua, "Spam detection on twitter using traditional classifiers."

In Autonomic and trusted computing, pp. 175-186. Springer Berlin Heidelberg, 2011.

[19] A. Wang, "Don't follow me: Spam detection in twitter," In Security and Cryptography

(SECRYPT), Proceedings of the 2010 International Conference on, pp. 1-10. IEEE, 2010.

[20] X. Jin, C. Lin, J. Luo, and J. Han, "A data mining-based spam detection system for social

media networks." Proceedings of the VLDB Endowment 4, no. 12 (2011): 1458-1461.

[21] V. Bewick, L. Cheek, and J. Ball, "Statistics review 14: Logistic regression." Crit Care 9, no.

1 (2005): 112-118.

[22] J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques. Elsevier, 2011. pp.

543-576.

[23] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander, "LOF: identifying density-based local

outliers." In ACM sigmod record, vol. 29, no. 2, pp. 93-104. ACM, 2000.

[24] L. Duan, L. Xu, Y. Liu, and J. Lee, "Cluster-based outlier detection."Annals of Operations

Research 168, no. 1 (2009): 151-168.

[25] V. Chandola, A. Banerjee, and V. Kumar. "Outlier detection: A survey." ACM Computing

Surveys (2007).

[26] A. L. Chiu, and A. W. Fu. "Enhancements on local outlier detection." In Database

Engineering and Applications Symposium, 2003. Proceedings. Seventh International, pp.

298-307. IEEE, 2003.

[27] C. Hsu, and Y. Huang. "Incremental clustering of mixed data based on distance hierarchy."

Expert Systems with Applications 35, no. 3 (2008): 1177-1185.

[28] D. R. Wilson, and T. R. Martinez. "Improved heterogeneous distance functions." Journal of

artificial intelligence research (1997): 1-34.

[29] C. Stanfill, and D. Waltz. "Toward memory-based reasoning." Communications of the ACM

29, no. 12 (1986): 1213-1228.