19
Thirty Fifth International Conference on Information Systems, Auckland 2014 1 Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping Zhang Department of Information Systems City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong SAR [email protected] Raymond Y. K. Lau Department of Information Systems City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong SAR [email protected] Chunping Li School of Software Tsinghua University Beijing, China [email protected] Abstract The explosive growth of user-contributed reviews in e-Commerce and online social network sites prompts for the design of novel big data analytics frameworks to cope with such a challenge. The main contributions of our research are twofold. First, we design a novel big data analytics framework that leverages distributed computing and streaming to efficiently process big social media data streams. Second, we apply the proposed framework that is underpinned by a novel parallel co-evolution genetic algorithm to adaptively detect deceptive reviews with respect to different social media contexts. Our experiments show that the proposed big data analytics framework can effectively and efficiently detect deceptive reviews from a big social media data stream, and it outperforms other non-distributed big data analytics solutions. To the best of our knowledge, this is the first successful design of an adaptive big data analytics framework for deceptive review detection under a big data environment. Keywords: Big Data Analytics, Streaming, Parallel Genetic Algorithm, Design Science Introduction With the rapid proliferation of the Social Web, user-contributed contents have become the norm. The amounts of data produced by individuals, business, government, and research agents have been undergoing an explosive growth—a phenomenon known as the data deluge. For individual social networking, many online social networking sites have between 100 and 500 million users these days. For electronic business, TaoBao, the largest e-commerce site in the Greater China area, generated over 200 million transactions and reached a peak rate of 205,000 transactions per minute just for its annual “singles’ day” sales event taking place on 11 November 2013. The streams of extremely big and evolving data from online social media, e-Commerce sites, search engines, and sensor networks have called for the research and development of the next generation of methods and tools to meet the challenge of data deluge. Big data is often characterized by three properties, named the 3 V’s: Volume, Velocity, and Variety (Boden et al. 2013). Currently, there are two different paradigms to deal with big data, namely distributed computing and streaming.

Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

  • Upload
    others

  • View
    6

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Thirty Fifth International Conference on Information Systems, Auckland 2014 1

Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media

Completed Research Paper

Wenping Zhang Department of Information Systems

City University of Hong Kong Tat Chee Avenue, Kowloon,

Hong Kong SAR [email protected]

Raymond Y. K. Lau Department of Information Systems

City University of Hong Kong Tat Chee Avenue, Kowloon,

Hong Kong SAR [email protected]

Chunping Li

School of Software Tsinghua University

Beijing, China [email protected]

Abstract

The explosive growth of user-contributed reviews in e-Commerce and online social network sites prompts for the design of novel big data analytics frameworks to cope with such a challenge. The main contributions of our research are twofold. First, we design a novel big data analytics framework that leverages distributed computing and streaming to efficiently process big social media data streams. Second, we apply the proposed framework that is underpinned by a novel parallel co-evolution genetic algorithm to adaptively detect deceptive reviews with respect to different social media contexts. Our experiments show that the proposed big data analytics framework can effectively and efficiently detect deceptive reviews from a big social media data stream, and it outperforms other non-distributed big data analytics solutions. To the best of our knowledge, this is the first successful design of an adaptive big data analytics framework for deceptive review detection under a big data environment.

Keywords: Big Data Analytics, Streaming, Parallel Genetic Algorithm, Design Science

Introduction

With the rapid proliferation of the Social Web, user-contributed contents have become the norm. The amounts of data produced by individuals, business, government, and research agents have been undergoing an explosive growth—a phenomenon known as the data deluge. For individual social networking, many online social networking sites have between 100 and 500 million users these days. For electronic business, TaoBao, the largest e-commerce site in the Greater China area, generated over 200 million transactions and reached a peak rate of 205,000 transactions per minute just for its annual “singles’ day” sales event taking place on 11 November 2013. The streams of extremely big and evolving data from online social media, e-Commerce sites, search engines, and sensor networks have called for the research and development of the next generation of methods and tools to meet the challenge of data deluge. Big data is often characterized by three properties, named the 3 V’s: Volume, Velocity, and Variety (Boden et al. 2013). Currently, there are two different paradigms to deal with big data, namely distributed computing and streaming.

Page 2: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Decision Analytics, Big Data, and Visualization

2 Thirty Fifth International Conference on Information Systems, Auckland 2014

Figure 1. Different Paradigms of Big Data Analytics

Most contents originally generated from the Social Web are streaming data. For example, the data representing actions and interactions among individuals in online social media, or the event-based data captured by sensor networks are continuously evolving streaming data. Other types of big data are perhaps just the snapshot view of streaming data. For a big data stream, data arrive at high speed, and the data analytics algorithms must process them in one pass, which implies processing under very strict constraints of space and time. Figure 1 highlights a broad classification of different paradigms of big data analytics. Big data analytics can be classified as distributed or single host based approaches. For single host approaches, analytical frameworks such as the BID Data Suite supports batch mode processing of big data (Canny and Zhao 2013), whereas the MOA framework can analyze evolving big data streams (Bifet et al. 2011).

For distributed methods, both Hadoop (http://hadoop.apache.org/), an open-source implementation of MapReduce, and Mahout (https://mahout.apache.org/) operate in batch mode. In contrast, S4 (http://incubator.apache.org/s4/) and Storm (http://storm.incubator.apache.org/) leverage distributed clusters of computing nodes for processing big data streams. Although batch mode big data analytics such as Hadoop is the mainstream to date, online incremental algorithms that can effectively process evolving data are desirable to address both the “volume” and the “velocity” dimensions of big data on the Web. MapReduce and stream-based big data analytics are two fundamentally different programming paradigms though they are related from a theoretical perspective. Recently, there are some attempts to incorporate streaming and incremental computation on top of MapReduce such as the Hadoop Online Prototype (https://code.google.com/p/hop/). We believe that combining both batch and streaming modes of analytics is essential to develop an operational big data analytics framework to support real-world applications.

While big data analytics have been receiving a lot of attentions from researchers and practitioners in recent years, few studies have examined the unintended consequence of big data (Wigan and Clarke 2013). For example, while there is an explosive growth of user-contributed contents such as product reviews posted to various social media (e.g., Twitter, Facebook, Epinions, Sina Weibo, etc.) and e-Commerce (e.g., Amazon, Resellerratings, Taobao, etc.) sites every day, effective methods of detecting and filtering low-quality contents from the big social media data stream are yet to come. Liu et al. (2007) employ the criteria of informativeness, readability, and helpfulness to distinguish between low-quality and high-quality online contents. In this paper, low-quality contents refer to deceptive and non-informative

Page 3: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Big Data Analytics for Detecting Deceptive Online Reviews

Thirty Fifth International Conference on Information Systems, Auckland 2014 3

contents that are created with a purpose to mis-lead readers. Figures 2 is an example of product related comments (e.g., Apple iPhone versus Samsung Galaxy) posted to online social media such as Twitter. These massively available online comments provide organizations with unprecedented opportunities to extract valuable business intelligence to strengthen the development of effective business strategies (Archak et al. 2011; Bollen and Mao 2011).

Figure 2. Products Related Comments at Twitter

Nevertheless, the widespread sharing and utilization of uncontrolled user-contributed reviews in social network and e-Commerce sites has also raised the serious concerns about the credibility and truthiness of these contents (Castillo et al. 2011; Dellarocas 2006; Lau et al. 2011; Ratkiewicz et al. 2011). On 23 September 2013, the Guardian reported that the attorney general of New York set up a fake yoghurt shop in Brooklyn to ensnare fake online review companies that posted deceptive reviews at Google, Yahoo, Yelp, and other online social media; 19 companies were eventually caught and fined a total of $350K. As a matter of fact, organizations or individuals have financial motives to strategically manipulate product related comments posted to online social media (Benevenuto et al. 2009; Dellarocas 2006; Hu et al. 2012). Accordingly, there is a pressing need to develop an effective and efficient methodology to detect low-quality reviews in online social media so that organizations and individuals are better protected from the so-called digital information war (Valle 2007). Nevertheless, the characteristics of big social media data prompt for the design of novel analytics framework to address the new challenges.

Ongoing efforts for the design of deceptive review detection methods have been reported in literature. Common approaches include near-duplicate detection methods (Jindal and Liu 2008; Lau et al. 2011; Sun et al. 2013), psycholinguistic based detection method (Ott et al. 2011), spammer behavioral based detection method (Mukherjee et al. 2013; Xie et al. 2012), and spammer-review graph analysis (Wang et al. 2012). However, all these methods were examined in e-Commerce settings only. On the other hand, though there were studies for detecting the credibility of newsworthy topics or political comments in online social media (Castillo et al. 2011; Ratkiewicz et al. 2011), few studies have examined the problem of detecting deceptive product related comments in online social media. Furthermore, existing methods often adopt an ad-hoc approach to manually select a subset of classification features that may lead to good detection performance for a particular setting. There is not a principled and systematic way to search for optimal or near optimal features with respect to a variety of detection contexts. After all, since spammers tend to modify their deceptive tactics (Lau et al. 2011), it is desirable to design an adaptive detection method that can continuously refine its feature set and system parameters based on users’ feedback. Unfortunately, research related to adaptive methods for detecting deceptive reviews is not found in existing literature. Above all, none of the existing work has examined big data analytics for the detection of deceptive reviews fed from big social media data streams. Our research aims to fill the aforementioned research gaps.

The main contributions and novelties of the research reported in this paper are as follows. First, a novel big data analytics framework that combines distributed computing and streaming to analyze big social

Page 4: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Decision Analytics, Big Data, and Visualization

4 Thirty Fifth International Conference on Information Systems, Auckland 2014

media data streams is developed. Second, we contribute to design a novel methodology that can detect deceptive reviews from among the sheer volume of messages posted to both e-Commerce and social media sites by leveraging the proposed big data analytics framework. Third, we contribute to design an adaptive big data analytics framework that is underpinned by a novel parallel co-evolution genetic algorithm (PCGA) so as to systematically search for near optimal feature sets to enhance detection performance. Since perpetrators constantly revise their deceptive strategies, it is highly desirable to design an adaptive big data analytics framework that can continuously learn and adapt to changing detection contexts. To the best of our knowledge, this is the first successful design of an adaptive big data analytics framework to deal with the unintended consequence of big social media data, that is, the increasing volume of low-quality contents in online social media.

From a theoretical standpoint, our research contributes to advance the knowledge and techniques of big data analytics, particularly adaptive big data analytics. The practical implication of our research is that organizations can apply our design artifacts (Hevner et al. 2004) to detect and filter deceptive comments from the big social media data stream. As a result, the effectiveness of business intelligence extraction from online social media is enhanced. The direct business implication is that more accurate and timely business intelligence is made available to enhance organizations’ business strategy development. The societal implication of our research is that individuals can leverage our design artifacts to identify deceptive reviews in online social media. As a result, they can make more informed decisions regarding online shopping, net-enabled financial investment, or other daily activities. Accordingly, consumer welfare is better protected in our society.

The main research questions of our study reported in this paper are summarized as follows:

Is the proposed big data analytics framework effective to detect deceptive product comments in different online social media settings?

Can the proposed PCGA enhance detection performance by searching through near optimal detection features and system parameters?

Is the proposed parallel and distributed big data analytics framework more efficient than a single host based big data analytics approach?

The rest of the paper is organized as follows. The next section highlights the adaptive big data analytics framework for deceptive review detection. Then, the computational methods of adaptively detecting deceptive reviews under a parallel and distributed framework are illustrated, followed by a discussion of our system evaluation and experimental results. Finally, we offer concluding remarks and discuss future directions of our research work.

Related Research

Boden et al. (2013) proposed a parallel data processing framework for large-scale social media analytics though the computational details and the evaluation of their framework were not reported. Tan et al. (2013) provided a conceptual exploration of social-network-sourced big data analytics although the implementation of the conceptual ideas was not discussed. Diaz-Aviles (2013) discussed the conceptual ideas about scaling up recommender systems for big social media analytics. Moreover, Lin and Ryaboy (2013) shared their real-world experience of making big data analytics operational at Twitter. Kumar et al. (2013) exploited the extension of relational database technologies to support big data analytics. Zeng and Lusch (2013) again provided a conceptual analysis and discussion of an ecosystem-based instead of transaction-based view of big data analytics; neither the computational details nor the implementation issues were examined in their preliminary study (Zeng and Lusch 2013). Canny and Zhao (2013) proposed the BID big data analytics suite that includes a collection of hardware, software, and design patterns to enable large-scale data mining at low cost. In particular, the BID suite consisted of the BID data engine, the BID matrix library that integrated CPU and Graphics Processing Unit (GPU) to accelerate typical data mining tasks, the BID machine learning library that incorporated efficient model optimizers, the butterfly communication model, and the design patterns for enhancing iterative update algorithms (Canny and Zhao 2013). Tan et al. (2012) examined the storage management issue instead of the analytical functionality of big data and developed the CIoST distributed storage system for storage and retrieval of big spatio-temporal data. As a whole, most of the big data analytics approaches discussed in literature

Page 5: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Big Data Analytics for Detecting Deceptive Online Reviews

Thirty Fifth International Conference on Information Systems, Auckland 2014 5

remain at the preliminary and conceptual analysis stage. Our research reported in this paper differs from the previous studies in that we not only address the conceptual issues of big data analytics, but also design a novel big data analytics framework and apply it to solve a critical business problem, that is, the increasing number of deceptive product comments posted to e-Commerce and social media sites.

Mukherjee et al. (2013) developed a hierarchical Bayesian generative model for unsupervised learning and detection of deceptive reviews at e-Commerce sites. The proposed Author Spamicity Model (ASM) captured the behavioral footprints (e.g., when a review was written) of opinion spammers, and model learning was conducted based on approximate posterior inference such as the Monte Carlo Gibbs sampling. Morales et al. (2013) examined the issue of machine-generated deceptive reviews. They articulated a deceptive review synthesization method and proposed a coherence-based method to detect these machine synthesized reviews. Xie et al. (2012) employed multi-scale multidimensional time series to detect the anomaly of deceptive review generation over a pre-defined time window (e.g., months). Ott et al. (2012) leveraged a hierarchical Bayesian generative model named Bayesian Prevalence Model (BPM) to estimate the prevalence of deceptive reviews in e-Commerce sites without actually retrieving a big review data set. Hu et al. (2012) examined the readability and sentiments of texts to detect deceptive reviews. In a similar context, Lau et al. (2010, 2011) employed probabilistic language modeling approaches to detect strategically manipulated online product reviews. Mukherjee et al. (2012) examined the spammer group detection problem by exploiting the inter-relationship among products, reviewer groups, and individual reviewers.

Wang et al. (2012) proposed the social review graph to capture the interactions among online stores, reviewers, and reviews, and hence to identify review spammers. By examining the psycholinguistic differences between legitimate reviews and deceptive reviews, Ott et al. (2011) examined a psychologically approach for detecting deceptive online reviews. Chen and Tseng (2011) employed an information quality (IQ) framework to guide the extraction of features for the prediction of online review quality. Castillo et al. (2011) examined the credibility of postings and re-postings at Twitter. Ratkiewicz et al. (2011) leveraged supervised machine learning methods for the development of a real-time service to detect the “truthy” of political memes in the Twitter stream. Benevenuto et al. (2009) examined the problem of automated detection of spammers and promoters in online video social networks. Benevenuto et al. (2010) applied supervised machine learning method to identify spammers as well as spam contents on Twitter. Grier et al. (2010) performed an empirical study about commercial spam at Twitter, and found that spammers at Twitter actually colluded to create spam campaigns. Previous work only focuses on detecting deceptive reviews in e-Commerce setting. There are studies about assessing the credibility of newsworthy topics or detecting commercial spam in online social media. Our work can fill the research gap by designing an adaptive methodology that can detect deceptive reviews in both e-Commerce and online social media settings. Moreover, we address the issue of systematically searching for near optimal features for detecting deceptive reviews in different contexts. Above all, we explore a big data analytics framework for deceptive review detection. Such an approach has not been discussed in existing literature.

The Adaptive Big Data Analytics Framework

An overview of the proposed framework that leverages Adaptive BIG Data Analytics for Deceptive review detection (ABIGDAD) is depicted in Figure 3. The ABIGDAD framework consists of seven main components, and they are implemented using well-proven methods and tools that are made available under open-source license agreements. For instance, the open-source distributed data stream engine (DDSE) for big data, Apache Storm (http://storm.incubator.apache.org/), is adopted to develop our data stream processor that handles the stream of messages retrieved by the dedicated APIs and our crawlers. In addition, a Twitter firehose and the open-source Twitter APIs such as Topsy (http://topsy.com/) is invoked to collect a large sample of Twitter messages and public user profiles during a test period. The main components of the ABIGDAD are described as follows:

Distributed File Storage System: the ABIGDAD framework leverages the open-source Apache HBase and Hadoop Distributed File System (HDFS) for the storage and retrieval of big data that are collected by dedicated APIs and crawlers. The big data of user-contributed comments are stored along a cluster of distributed commodity computers.

Page 6: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Decision Analytics, Big Data, and Visualization

6 Thirty Fifth International Conference on Information Systems, Auckland 2014

Data Stream Processor: the data stream processor of ABIGDAD consists of three main elements, a message filter, the open-source distributed data stream engine named Storm, the collection of APIs and crawlers. The DDSE can process a big stream of incoming messages in parallel by means of a cluster of distributed computing nodes. Our current implementation supports both commercial APIs such as the Datasift Twitter firehose and open-source Twitter APIs such as the Topsy API, as well as an array of crawlers such as the Sina Weibo crawler. The message filter is responsible for filtering messages not related to products and services. Though APIs such as Topsy and Datasift have their built-in message filter, it is still necessary to have our own filter to enhance the precision of product related message filtering. Filtered relevant messages are then stored in our distributed file system via HBase and HDFS.

Figure 3. The ABIGDAD Framework

Data Pre-processor: the data pre-processor invokes the Stanford Dependency Parser (http://nlp.stanford.edu/software/lex-parser.shtml) and the GATE Named Entity Recognition (NER) module (Maynard et al. 2001) to extract syntactical, lexical, and stylistic features for subsequent deceptive review detection. Our current implementation includes a Chinese-English machine translator, and a corresponding Chinese word segmentation and part-of-speech (POS) tagging module named ICTCLAS (http://ictclas.org/). Since this paper focuses on English text processing only, we will not further discuss the bi-lingual message processing capability of the proposed framework. During text pre-processing, a text cleaner is invoked to further tidy up incoming messages such that markups and snippets are removed from incoming messages.

Feature Miner: the feature miner consists of the content-based miner, the behavior-based miner, and the message diffusion network-based miner to adaptively extract different kinds of features from filtered messages for subsequent deceptive review detection. For content-based feature mining, the WordNet-Affect lexicon (Valitutti et al. 2004) and SentiWordNet (Esuli and Sebastiani 2005) are applied to mine affect from messages. The computational details of these feature mining processes will be described in the

Page 7: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Big Data Analytics for Detecting Deceptive Online Reviews

Thirty Fifth International Conference on Information Systems, Auckland 2014 7

following section. Near-duplicate review detection is also a content-based detection approach. However, since the computational methods of near-duplicate review detection have been examined by previous research (Jindal and Liu 2008; Lau et al. 2011; Mukherjee et al. 2013), we will not further discuss their details in this paper. Our framework can utilize the existing cosine similarity metric or a probabilistic language modeling based method for near-duplicate deceptive review detection (Lau et al. 2011). Though near-duplicate detection methods are quite effective to detect deceptive reviews, their general computational complexity is characterized by O(n2), where n is the number of messages. None of the previous research has discussed how to scale up near-duplicate review detection to the Web scale (Jindal and Liu 2008; Lau et al. 2011; Mukherjee et al. 2013). One novel contribution of our work is that we leverage the ABIGDAD framework to scale up near-duplicate deceptive review detection by means of parallel and distributed processing. In particular, a heuristic approach is employed to divide messages into subsets according to the common product catalog defined by Amazon. Then, parallel and distributed near-duplicate deceptive review detection is performed against these subsets to improve the overall efficiency.

Ensemble of Classifiers: the ABIGDAD framework utilizes a classification ensemble to improve the accuracy of prediction. We use a dozen of classifiers such as a Naive Bayes classifier, four SVM classifiers with linear, polynomial, RBF, and hyperbolic tangent kernels from the LIBSVM package (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), a conditional random field (CRF) classifier (Sarawagi 2006), two decision tree classifiers such as the 4.5 decision tree and the J48 decision tree (http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html), a logistic regression classifier, a two layer feedforward perceptron neural network from the Weka package, an AdaBoost classifier, and a Hidden Markov Model (HMM) classifier. An active learner (Xu et al. 2014) is coupled with each classifier such that the most informative training set is selected from a big data environment to effectively train the classifier. Due to limited space, this paper will not discuss the active learning aspect of our framework. Finally, a weighted voting scheme is applied to generate the final prediction based on the classification results derived by individual classifiers (Hullermeier and Vanderlooy 2010). One novelty of our classification ensemble method is that the weight of each classifier is continuously tuned by the proposed parallel coevolution genetic algorithm.

Learning and Adaptation: the learning and adaptation component utilizes a novel parallel coevolution genetic algorithm to adaptively select near optimal subsets of features for deceptive review detection performed by the prediction layer. In addition, the PCGA also search for the near optimal system parameters based on occasional user feedback routed via the communication manager. One unique characteristic of the proposed big data analytics framework is that sub-populations of the PCGA are co-evolved in parallel via a cluster of computing nodes to improve computational efficiency. Given a large feature set and a large number of system parameters, co-evolution based on parallel and distributed computing is essential and it scales up evolutionary learning under a big data environment. The computational details of our parallel coevolution genetic algorithm will be illustrated in the next section.

Communication Manager: the communication manager consists of four sub-components, namely a detection dashboard, a mobile presenter, a query processor, and a user feedback collector. The detection dashboard conveys detection results to users via texts, tables, graphs, and audio outputs. Moreover, the mobile presenter of the communication manager can adjust presentation formats and layouts so that detection results can be displayed properly on mobile devices. The query processor accepts users’ queries about certain product or product categories for deceptive review detection, whereas the feedback collector takes both explicit and implicit feedback from users and routes user feedback to the learning and adaptation component for continuous fine-tuning of various components of the framework.

Although several big data analytics frameworks have been proposed to process big data streams fed from online social media (Boden et al. 2013; Canny and Zhao 2013; Tan et al. 2013; Zeng and Lusch 2013), none of these framework is equipped with a learning and adaptation component that can bootstrap the performance of a big data analytics system. Since the proposed big data analytics framework is designed to detect deceptive product comments in online social media and perpetrators tend to revise their deceptive strategies, it is essential to develop an adaptive big data analytics framework that can continuously learn and adapt to changing detection contexts. This is indeed one of the novelties of the proposed ABIGDAD framework.

Page 8: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Decision Analytics, Big Data, and Visualization

8 Thirty Fifth International Conference on Information Systems, Auckland 2014

The Computational Methods

This section illustrates the computational methods of extracting content-based, behavior-based, and diffusion network-based features for the detection of deceptive reviews. Since some features are setting-specific (e.g., star rating of a review for the e-Commerce setting) or even site-specific (e.g., hashtags for Twitter), only the relevant features pertaining to a message are utilized by our classifiers and the PCGA. The classification ensemble based prediction module and the PCGA will be explained in the following sections.

Content-based Features

According to previous studies (Mukherjee et al. 2013; Ott et al. 2011; Piskorski et al. 2008), we identify a candidate set of features for deceptive review detection under different settings. Previous research reported content-based features that worked well under a specific setting (e.g., the Amazon site). However, there is no guarantee that such a subset of features is near optimal or not, and whether it work well under another setting (e.g., Twitter). Our research can fill such a research gap by using the proposed PCGA to search for near optimal subset of features for a specific detection setting. When a new setting is encountered, we can systematically apply the proposed PCGA instead of using an ad-hoc manual tuning method (Benevenuto et al. 2010; Castillo et al. 2011; Jindal and Liu 2008; Mukherjee et al. 2013; Ott et al. 2011; Ratkiewicz et al. 2011) to identify a set of near optimal features. Based on the Corleone (CL) and the General Inquirer (GI) linguistic data processing package (http://www.wjh.harvard.edu/~inquirer/) that were applied to Web spam detection before (Piskorski et al. 2008), we identify 23 candidate features from CL and 185 candidate features from GI. In addition, 80 psycholinguistic features that are originally defined in the Linguistic Inquiry and Word Count (LIWC) lexicon and successfully applied to deceptive review detection (Ott et al. 2011) are also included in our study. Three readability measures, namely the Gunning-Fog index, the Flesch-Kincaid measure, and the SMOG measure are also included as the candidate features. Moreover, 173 common emoticons defined by the NetLingo package (http://www.netlingo.com/smileys.php), number of URLs contained in the message, number of pointers to multimedia contents, and length of message are included in our candidate feature set.

Behavior-based Features

Behavior-based features refer to both a writer’s characteristics and how s/he produces product related comment. We employ 23 behavior-based features in our study. These candidate features include: (1) writer registration age; (2) writer logo, (3) writer description; (4) gender of writer; (5) registration location of writer; (6) URL of writer; (7) writer status; (8) writer verification; (9) no. of friends of writer; (10) no. of followers of writer; (11) no. of comments produced by writer; (12) no. of comments of the same product produced by writer (13) no. of replied messages; (14) no. of forwarded messages; (15) the time zone of a day when the comment is written; (16) the day of a week the comment is produced; (17) the first comment for a product; (18) the helpful vote of the comment; (19) the client system that generates the comment; (20) rating inconsistency; (21) rating deviation of the product; (22) sentiment inconsistency; (23) sentiment deviation for the product. For candidate features (20) and (21), they are setting-specific and mostly applied to the product reviews of e-Commerce sites rather than the product related comments posted to online social media.

Diffusion Network-based Features

Diffusion network-based features refer to how a product related comment is propagated in online social media. This set of features is applicable to the setting of online social media, or an e-Commerce site that supports social networking function such as Epinions. According to previous study (Castillo et al. 2011), basic diffusion network-based features include (1) the connection degree of a node (user) who first propagates the comment; (2) total number of messages in the largest sub-graph under the root node; (3) the depth of a comment propagation network; (4) the average degree of a node in the propagation network; (5) the maximal degree of a node in the propagation network. In this paper, we also propose a novel diffusion network-based feature that is based on the concept of proximity prestige in social network analysis (SNA) (Wassermand and Faust 1999). The basic intuition is that if a product related comment is propagated from a relatively new user to many other weakly connected users (e.g., they are strangers to

Page 9: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Big Data Analytics for Detecting Deceptive Online Reviews

Thirty Fifth International Conference on Information Systems, Auckland 2014 9

each other) in a social network, it is more likely that the comment is a spam. In contrast, if the comment is diffused from a long-time registered network user to other strongly connected users (e.g., they are close friends) in the network, the chance for the user to spread a spam is smaller. Our proposed average

diffusion score ( )AvgDiff G of a review diffusion network G is defined as follows.

Definition 1 (Review Diffusion Network) A review diffusion network is a weighted directed graph

( , )G V E that comprises a finite set of nodes V and edges E. Each node iv V represents a user ix of a social

network, while an edge ije E is an ordered pair of nodes iv and jv . In particular, ije indicates that user ix

refers a product review to another user jx through a social connection. The weight of an edge ( ) (0,1]ijw e

represents the strength of the social connection between ix and jx estimated based on the frequency of their

communications in the network.

Definition 2 (Review Propagation Path) Let ijP be the set of all possible review propagation paths from a

source node iv to a sink node jv , then a review propagation path ij ijp P is a directed acyclic path that

comprises a sequence of pairs of nodes such as 1 1 1( , ), ( , ), , ( , )i i i k k jv v v v v v satisfying the conditions: (1) the

source node iv and sink node jv only appearing once in the sequence; (2) , ,k ij k i k jv P v v v v , node kv

only appearing twice in the sequence with the first appearance as the ending node in a pair and the second

appearance as the starting node of the immediately following pair. The length of a review propagation path ijp is

defined by the number of pairs of nodes along the path, that is, ( ) { }ij ijlen p p .

Definition 3 (Shortest Review Propagation Path) The shortest review propagation path denoted short

ijp from

a source node iv to a sink node jv holds the property: : , ( ) ( )short short

ij ij ij ij ij ijp P p P len p len p .

Definition 4 (Review Diffusion Reachability) The review diffusion reachability of a node iv , denoted

( )ireach v is defined by the set of unique nodes along all review propagation paths originated from iv , that is,

( ) { : ( ) ( ( ) 0)}i j i j ijreach v v V v v len p .

( ),

( )

( ) /( -1)

( )

/ Reach( )

shortj i ij ij

j i

ij

v reach v e p

i

short

ij i

v reach v

w e V

UserPres v

len p v

(1)

( ) ( )

( )j

i i

v V

UserPres v v

AvgDiff GV

(2)

It is important to measure a node’s review propagation path and the shortest propagation path in the diffusion network according to Definitions 2 and 3 because a user may route a review through many different ways. However, it is the shortest propagation paths that can reflect the user’s direct accessibility to other users in the network. The basic intuition is that an ordinary user tend to have many direct connections to her friends, whereas a spammer may not have such a connection prestige. The

computational complexity of Eq.2 is 2(| | )O V . The proposed big data analytics can significantly improve

computational efficiency by dividing the whole diffusion network G into sub-graphs and then estimate the average diffusion score based on each sub-graph in parallel.

Page 10: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Decision Analytics, Big Data, and Visualization

10 Thirty Fifth International Conference on Information Systems, Auckland 2014

Classification Ensemble

A classification ensemble that consists of a dozen of state-of-the-art classifier is applied to improve the detection accuracy. These classifiers are invoked in parallel via distributed nodes of a cluster. A weighted voting scheme is used to make the final prediction based on the classification results produced by individual classifiers (Hullermeier and Vanderlooy 2010). One novelty of our classification ensemble method is that the weight of each classifier is continuously tuned by the proposed parallel coevolution genetic algorithm. For a binary prediction, the ensemble determines the final deception detection label according to the following formula:

1 if ( ) ( )> ( ) ( )

0 otherwise

c Pos c Neg

w c vote c w c vote cDeScore

(3)

where Pos and Neg are the respective sets of classifiers that make positive or negative prediction.

c Pos denotes a classifier and ( ) {0,1}vote c is the individual prediction of the classifier.

( ) [0,1]w c is the weight of the classifier. If 1DeScore is established, it means that the

corresponding review is classified as a deceptive review.

Since deceptive review detection involves uncertainty, a ranking based detection result can also be presented to a user according to their user preference captured by the presentation layer. A ranking based prediction is constructed according to the deception score defined by the following formula:

( ) ( )c Ensem

DeScore w c score c

(4)

where Ensem is the set of classifiers of the classification ensemble, and ( ) [0,1]score c is the normalized

classification score produced by an individual classifier c.

The Parallel Co-evolution Genetic Algorithm

A novel parallel co-evolution genetic algorithm (PCGA) is designed so that a large search space can be divided into some subspaces for a parallel and diversified search, which improves both the efficiency and the effectiveness of the heuristic search process. Each subspace (i.e., a sub-population) is managed by a separate computing node of a cluster. Three main elements are involved for the design a genetic algorithm (GA), that is, a fitness function, chromosome encoding, and a procedure that drives the evolution process of chromosomes (Goldberg 1989; Lau et al. 2006; Roberge et al. 2014). First, the fitness function of our PCGA is developed based on a performance metric (i.e., the top-20 precision of the prediction). Second, since various components of the proposed framework should be continuously refined, there are multiple sub-populations of chromosomes to be encoded and co-evolved simultaneously. Figure 4 shows a sketch of our chromosome encoding and co-evolution process. For detection feature encoding, a binary encoding scheme is used, whereas a real value encoding scheme is applied to system parameters. During a migration cycle, the best chromosome of a sub-population (e.g., linguistic review features, weights of classifiers, social media sources, system parameters) is exchanged with that of other sub-populations. Armed with all the information for features, classifiers, and system parameters, the combination of the best chromosomes from all sub-populations represents a feasible detection solution, and its fitness can be assessed according to the following equation:

20

20

( )top

top

TruePositivefitness com

SysPositive (5)

where 20topTruePositive and 20topSysPositive represent the true positive and the system’s predictions at

the top twenty positions, respectively. The rationale for using the top twenty precision P@20 as the fitness function is that it drives the improvement of the performance of the whole system. A similar metric has been successful applied to information retrieval benchmark test (Ounis et al. 2008).

Page 11: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Big Data Analytics for Detecting Deceptive Online Reviews

Thirty Fifth International Conference on Information Systems, Auckland 2014 11

Figure 4. Chromosome Encoding and the Parallel Co-evolution Process

Algorithm PCGA(Fit,Gen,Pop,Psize[x],Pm[x],Pc[x], ER[x], MIG[x]) Inputs: Fit /* the average fitness threshold for GA termination Gen /* the maximum no. of generations of an evolution cycle POP /* no. of populations for co-evolution e.g., 10 Psize[x] /* the population size of a specific subpopulation x Pm[x] /* the mutation probability of a specific subpopulation x Pc[x] /* the crossover probability of a specific subpopulation x

ER[x] /* the elitism rate of a specific subpopulation x

MIG[x] /* the periodic migration threshold of a specific subpopulation x Outputs: Ch[x] /* the fittest chromosomes from each subpopulation

Main Procedure: 1. start parallel processing, for x = 1 to POP /* executed on each distributed processor 2. g = 1, exg = 1; /* index of the first generation, index of migration process 3. randomly assign feasible value to each gene of each chromosome of each subpopulation P[x]g; 4. compute fitness of each chromosome in each subpopulation P[x]g for a generation g based on each fitness function; 5. exchange the fittest chromosomes among each subpopulation P[x]g; 6. while g <= Gen and Fitness(P[x]g) <= Fit[x]

7. select ER[x] percentage of fittest chromosomes of a subpopulation P[x]g to build next P[x]g+1 ; 8. while size(P[x]g+1) <= Psize[x] 9. apply roulette-wheel selection to select two chromosomes to produce P[x]g+1; 10. generate a random number r for each pair of selected chromosome in a population; 11. if r < Pc[x], apply uniform crossover to the selected chromosomes; 12. generate a random number r for each evolvable gene of each chromosome; 13. if r < Pm[x], apply mutation to the evolvable gene of each selected chromosome; 14. end

15. compute the fitness of each chromosome in each subpopulation P[x]g+1 for generation g+1; 16. if exg > MIG[x] 17. exchange the fittest chromosomes among each subpopulation P[x]g+1; 18. exg = 1; 19. end 20. g++, exg++; 21. end 22. return the fittest chromosome Ch[x] from each subpopulation P[x]g;

23. end parallel processing

24. exit

Figure 5. The Parallel Co-evolution Genetic Algorithm

The proposed fitness function makes a good trade-off between system effectiveness and the amount of human intervention. The learning and adaptation cycle is only invoked periodically, particularly when the

Page 12: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Decision Analytics, Big Data, and Visualization

12 Thirty Fifth International Conference on Information Systems, Auckland 2014

detection performance of the system is degraded or a new detection setting (e.g., an online social network) is applied. In either case, a user only needs to judge the correctness of the top twenty predictions provided by the system. There might be a temptation to develop the fitness function based on the F-score. However, it is not feasible to calculate the F-score because the ground truth of deceptive review detection is generally not available. Figure 5 illustrates the PCGA algorithm that drives the co-evolution processes of various sub-populations of chromosomes. At the beginning of a co-evolution process (i.e., the first generation), the chromosomes of each sub-population are initialized by randomly assigning feasible values to each evolvable gene. Then, the fitness of each individual (chromosome) of each sub-population is evaluated. During fitness evaluation, the fittest individuals from other sub-populations are used to derive the system’s predictions. Then, the fitness of each individual of the current sub-population is calculated according to Eq.5. The fittest chromosome of each sub-population is exchanged periodically according to a pre-defined migration threshold MIG[x]. A high-level view of the co-evolution and chromosome exchange process is depicted in Figure 4. For each sub-population, standard genetic operators such as selection, crossover, and mutation are applied to re-produce individuals of the next generation (Goldberg 1989; Lau et al. 2006). Moreover, the elitism rate ER[x] of a sub-population is applied to directly transfer a certain percentage of the fittest chromosomes from the current generation to the next generation in order to make a balance between exhaustive and diversified search.

Roulette wheel selection (Maiti and Maiti 2008) is applied to choose relatively fitter chromosomes from the current generation to produce individuals of the next generation. After two chromosomes are selected for re-production, the genetic operation of two-point crossover is applied according to a pre-defined crossover probability Pc[x] of a sub-population x. Each chromosome of the selected pair is then considered for the mutation operation after the crossover operation is finished. The evolutionary process (i.e., selection, crossover, and mutation) is repeated until the number of individuals of the next generation reaches the pre-defined number Psize[x]. The aforementioned evolutionary process is applied to each sub-population x. If the average fitness of each sub-population reaches a pre-defined threshold Fit[x], or the number of generations re-produced exceeds the maximum number of generations Gen, the PCGA algorithm will be terminated. At this stage, the fittest chromosome from each sub-population is chosen to drive the deceptive review detection process.

When compared to other state-of-the-art parallel genetic algorithms (Roberge et al. 2014), our proposed PCGA differs in that a master-slave control model is not used. In contrast, each sub-population is evolved in parallel on a separate computing node of a cluster to maximize computational efficiency. In addition, we adopt an indirect migration process in the sense that the fittest chromosome from another sub-population does not directly migrate to a sub-population under consideration. Instead, the exchanged chromosome is only utilized to compute the fitness of the current individuals. In other words, the fittest chromosomes from outside are used to indirectly influence the evolution of the individuals of the sub-population under consideration. It should be noted that parallel GAs also employ standard genetic operators such as selection, crossover, and mutation (Roberge et al. 2014); it is the parallel and distributed co-evolution and migration processes that really distinguish between traditional GAs and parallel GAs.

Experiments and Results

We performed two experiments to evaluate the effectiveness of the adaptive deceptive review detection method under two different settings (Amazon and Twitter). Moreover, a final experiment was applied to evaluate the efficiency of the big data analytics framework.

Experiment for the e-Commerce Setting

We crawled 5,698,248 product reviews from 827,608 products belonging to 30 product categories transacted at Amazon during the period from July to September 2012. We then selected 500 popular products according to the product ranking provided by Amazon as our evaluation targets. These popular products include gourmet food, grocery, jewelry, office products, PC hardware, books, and electronic products. By following the approach proposed by Ott et al. (2011), we leveraged the crowdsourcing method to develop deceptive reviews (i.e., spam) via Mechanical Turk and drawing random samples from the reviews of the aforementioned popular products to build the legitimate review subset (i.e., ham). For each popular product, there were one ham and one spam identified; there were a total of 500 ham and

Page 13: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Big Data Analytics for Detecting Deceptive Online Reviews

Thirty Fifth International Conference on Information Systems, Auckland 2014 13

500 spam in our evaluation data set. For the first detection run, we applied all applicable content-based and behavior-based features for deceptive review detection by the proposed classification ensemble. For a comparative study, we also adopted the SVM (linear kernel) that was used in a previous study (Ott et al. 2011) as the baseline classification module. For the second detection run, we used the previously established ground truth to compute the top twenty precision P@20, and hence to assess the fitness of the chromosomes of the proposed PCGA. After PCGA enabled feature selection and system parameter tuning, we applied both the classification ensemble and the SVM baseline to detect deceptive review again. The second run allows us to evaluate if the proposed PCGA could improve detection performance or not. The performance measures we used included common metrics such as precision, recall (i.e., true positive), and F-score, and the ham misclassification rate (hm), spam misclassification rate (sm), and the combined hm and sm (lam) used in the TREC Spam Track (Cormack 2007). In addition, the Receiver Operating Characteristic (ROC) curve (Hand and Till 2001), which is the graphical representation of recall (i.e., 1 - sm) as a function of hm, is also applied.

Table 1. Deceptive Review Detection at the Amazon Setting

The First Detection Run

Recall Precision F-score hm sm lam

SVM 59.40% 62.39% 60.86% 35.80% 40.60% 38.17% Ensemble 64.40% 67.22% 65.78% 31.40% 35.60% 33.47%

The Second Detection Run SVM 64.20% 68.88% 66.46% 29.00% 35.80% 32.31% Ensemble 70.80% 74.84% 72.76% 23.80% 29.20% 26.41%

Improvement +9.94% +11.33% +10.62% +31.93% +21.92% +26.71%

Figure 6. The ROC Curves of Two Detection Methods Before/After Adaptation

The results of this experiment are tabulated in Table 1. The genetic parameters applied to the PCGA were set up as follows: population size Psize[x] = 100, elitism rate ER[x] = 10%, crossover probability Pc[x] = 0.82, mutation probability Pm[x] = 0.05, maximum number of generations Gen = 100, maximum average fitness Fit[x]= 0.85, and migration frequency MIG[x] = 20. Although, each sub-population might have its own set of genetic parameters, we only applied the uniform genetic parameters to all sub-populations for simplicity. Table 1 shows that both the performance of the SVM classifier and the classification ensemble is substantially improved after one round of learning and adaptation cycle. The reason is that the most discriminative features are selected by the PCGA via an effective heuristic search. According to previous study (Goldberg 1989), genetic algorithms can usually lead to near optimal search outcome over a large search space. As a result, the classification ensemble and the SVM baseline classifier can use the best features and system parameters to detect deceptive reviews. The performance improvements achieved by the classification ensemble is 10.62% and 26.71% in terms of F-score and combined hm and sm score, respectively. This experiment also demonstrates that the proposed classification ensemble method is

Page 14: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Decision Analytics, Big Data, and Visualization

14 Thirty Fifth International Conference on Information Systems, Auckland 2014

effective for deceptive review detection after minimal amount of training. It achieves a F-score of 72.76% and a lam score of 26.41%, respectively. The top ten most discriminative features selected by our PCGA and ranked by the 2 measure are: (1) no. of emotional words, (2) near-duplicate content, (3) sentiment

inconsistency, (4) rating inconsistency, (5) emotiveness, (6) no. of URLs, (7) no. of pointers to multimedia content, (8) emoticon smile, (9) length of message, and (10) syntactical entropy. The ROC curves of the two classifiers pertaining to the two detection runs are plotted in Figure 6. It shows that the classification ensemble consistently outperforms the SVM baseline in each of the detection run (the ROC curve is on the top).

Experiment for the Social Media Setting

For the second experiment, we collected information about microblogging users and their tweets via a Twitter firehose in August 2013. A total of 189,312 public user profiles and 3,483,712 raw tweets were obtained. After message filtering for product related comments, the raw data set was substantially reduced to 253,411 tweets routed through 21,622 users. Under a social media setting, it is extremely difficult to determine the ground truth of deceptive comments even with the help of crowdsourcing. Accordingly, we adopted an indirect assessment method as proposed by Mukherjee et al. (2013). More specifically, we applied the proposed framework to identify deceptive comments and then invited three human assessors to evaluate the correctness of the top twenty predictions. In other words, only the precision of the framework was possibly assessed in this experiment. Only if the majority of human assessors thought that a comment detected by the system was correct, would it be considered as a true positive case.

For the first detection run, the proposed classification ensemble and the SVM baseline classifier applied all the applicable content-based, behavior-based, and diffusion network-based features to detect deceptive reviews. After the human assessors evaluated the detection results produced from the first run, the user feedback was consumed by the PCGA to learn and select the most discriminative features. The weight of each classifier and the various system parameters were also tuned during the learning and adaptation process. For the second detection run, the system utilized the refined features and system parameters to detect deceptive comments. At the end of the second run, the human assessors evaluated the top twenty system prediction again. The results of the second experiment are tabulated in Table 2. The genetic parameters applied to this experiment were the same as those used in the first experiment. The p@20 precision improvements achieved by the classification ensemble and the SVM baseline are +21.43% and +25.00%, respectively. This experiment also reveals that the proposed classification ensemble method is effective for deceptive review detection after minimal amount of training. The top ten most discriminative features selected by our PCGA are: (1) writer registration age, (2) AvgDiff(G), (3) no. of friends, (4) no. of followers, (5) sentiment inconsistency, (6) near-duplicate content, (7) no. of pointers to multimedia content, (8) the maximum degree of a node in G, (9) emoticon frown, and (10) no. of comments of the same product produced by the writer. The proposed AvgDiff(G) feature is effective since it captures both the characteristic of a writer’s registration time and how a deceptive comment is propagated in a social network.

Table 2. Deceptive Review Detection at the Twitter Setting

P@20 Precision

1st Run 2nd Run Improvement

SVM 60% 75% +25.00% Ensemble 70% 85% +21.43%

The average fitness as well as the best fitness of the sub-population representing various system parameters are plotted in Figure 7. This evolution process corresponds to the learning and adaptation cycle involved after the end of the first detection run. It shows that the fitness of this sub-population converges well in the last 20 rounds of evolution. Table 3 depicts the details such as the average fitness, the best fitness, and the standard deviation of fitness of this sub-population. Because of limited space, we only tabulate the details for the last 15 generations. The decreasing standard deviations of the fitness among the individuals of this sub-population in the last 15 generations demonstrate a good convergence process.

Page 15: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Big Data Analytics for Detecting Deceptive Online Reviews

Thirty Fifth International Conference on Information Systems, Auckland 2014 15

Figure 7. The Evolution Process of the Sub-population Representing System Parameters

Table 3. Details of the Evolution Process for System Parameters

Generation Average Fitness

STD of Fitness

Best Fitness

81 0.6399 0.3821 0.85 82 0.6432 0.3801 0.85 83 0.6784 0.3795 0.85 84 0.6532 0.3796 0.85 85 0.6432 0.3711 0.85 86 0.6399 0.3654 0.85 87 0.6309 0.3612 0.85 88 0.6274 0.3519 0.85 89 0.6294 0.3508 0.85 90 0.6294 0.3477 0.85 91 0.7094 0.3486 0.85 92 0.7191 0.3319 0.85 93 0.6633 0.3246 0.85 94 0.6621 0.3209 0.85 95 0.6722 0.2933 0.85 96 0.6902 0.2856 0.85 97 0.7210 0.2554 0.85 98 0.7255 0.2506 0.85 99 0.7399 0.2395 0.85

100 0.7413 0.2183 0.85

Experiment for Testing the Efficiency of the Framework

The main objectives of this experiment are to evaluate: (1) the computational efficiency of performing the demanding near-duplicate deceptive review detection (characterized by O(n2) where n is the number of reviews); (2) the overall system throughput of deceptive review detection. To make a comparative evaluation, we implemented the same computational modules under two single host environments. The first baseline platform named DQuad-CPUs was empowered by a single host with dual quad-core Intel 2.33 GHz CPUs, 32GB RAM, and 4 x 8TB hard disk. The second baseline platform named DQuad-CPU-GPUs was empowered by a single host with dual quad-core Intel 2.33 GHz CPUs and GPU (Nvidia GTX-

Page 16: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Decision Analytics, Big Data, and Visualization

16 Thirty Fifth International Conference on Information Systems, Auckland 2014

690s). In contrast, the proposed ABIGDAD framework was implemented based on a cluster of 30 commodity computers. Each computing node was equipped with a dual core Intel 2.33GHz CPUs, 16GB RAM, and 1TB hard disk.

For the first task, we ran the near-duplicate deceptive review detection for the 5,698,248 Amazon reviews using a probabilistic language modeling approach (Lau et al. 2011). For the ABIGDAD experimental system, the whole set of reviews was distributed among the 30 nodes of a cluster. Then, probabilistic near-duplicate detection was invoked in parallel among the 30 nodes. For the two baseline platforms, all reviews were loaded into a single host for near-duplicate review detection. Our experimental results based on the Amazon data set are depicted in Table 4. The second row of Table 4 depicts the average time consumed by a map task. It is clear that the proposed ABIGDAD framework substantially outperforms the other two baselines. This experiment shows that the ABIGDAD framework can achieve orders-of-magnitude performance improvement over the best baseline.

For the second task, we run a stress test during a two-day period. All Amazon reviews and Twitter messages that we collected before were fed to the evaluating systems by our simulator. These messages were evenly distributed during the two-day test period. For the message originated from a specific source, the classification layer utilizes the corresponding optimized features learned from the previous learning and adaptation cycle to carry out prediction. The average arrival rate of messages is 2,066 messages per minute (m/m). Once the number of incoming messages reaches a pre-defined threshold, the ABIGDAD system will invoke the detection modules. The average detection time of the three systems is shown in Table 4. Again, the ABIGDAD system substantially outperforms the other two baselines, and it achieves an average throughput of detecting 10,316 messages per minute. Such a system throughput of ABIGDAD is likely to handle the realistic workload because external APIs can help filtering out most noisy and non-relevant messages beforehand.

Table 4. Efficiency of Different Big Data Analytics Frameworks

ABIGDAD DQuad-CPU-GPUs DQuad-CPUs Improvement

Amazon Review Detection

3.5 min 9.1 min 15.7 min +160%

Simulated Stress Test

10,316 m/m 4,213 m/m 2,215 m/m +144%

Conclusions and Future Work

Essentially, all our research questions are answered through this study. Through the Amazon and Twitter-based experiments, it shows that the proposed framework can effectively detect deceptive product comments in different settings. The proposed PCGA can bootstrap detection performance by adaptively selecting good feature subset and system parameters. Finally, our system stress test demonstrates that the ABIGDAD framework implemented based on a cluster of distributed commodity computers is more efficient than a single host based big data analytics approach. To the best of our knowledge, this is the first successful design of an adaptive big data analytics framework for the detection of deceptive reviews from a big social media data stream.

The practical implication of our research is that organizations can apply our design artifacts to detect and filter deceptive reviews from big social media data streams. As a result, the effectiveness of business intelligence extraction from online social media is enhanced. The direct business implication is that more accurate and timely business intelligence is made available to enhance organizations’ management practice and marketing strategies. The societal implication of our research is that individuals can leverage our design artifacts to identify deceptive reviews in online social media. As a result, they can make more informed decisions regarding online shopping, net-enabled financial investment, or other daily activities. Accordingly, consumer welfare is better protected in our society.

Future work will examine a more sophisticated parallel graph analysis method to improve the performance of our diffusion network-based feature mining process. In addition, the optimal or near optimal frequency of invoking the learning and adaptation component of the proposed framework will be explored to further improve detection performance. Finally, more rigorous experiments and user-based

Page 17: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Big Data Analytics for Detecting Deceptive Online Reviews

Thirty Fifth International Conference on Information Systems, Auckland 2014 17

field tests will be performed to compare the performance of the ABIGDAD framework and that of other existing big data analytics solutions.

Acknowledgements

The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 145712), a SRG grant from City University of Hong Kong (Project No. 7008138), the Shenzhen Municipal Science and Technology R&D Funding – Basic Research Program (Project No. JCYJ20130401145617281).

References

Archak, N., Ghose, A., and Ipeirotis, P. 2011. “Deriving the Pricing Power of Product Features by Mining Consumer Reviews,” Management Science (57:8), pp. 1485-1509.

Benevenuto, F., Magno, G., Rodrigues, T., and Almeida, V. 2010. “Detecting Spammers on Twitter,” in Proceedings of the Seventh Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, Redmond, Washington, pages 1-9.

Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., and Gonçalves, M. 2009. “Detecting Spammers and Content Promoters in Online Video Social Networks,” in Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, MA, pages 620-627.

Bifet, A., Holmes, G., Pfahringer, B., Read, J., Kranen, P., Kremer, H., Jansen, T., Seidl, T. 2011. “MOA: A Real-Time Analytics Open Source Framework,” in Machine Learning and Knowledge Discovery in Databases, LNCS Vol. 6913, D. Gunopulos, T. Hofmann, D. Malerba, and M. Vazirgiannis (eds.), New York: Springer-Verlag, pp. 617-620.

Boden, C., Karnstedt, M., Fernandez, M., Markl, V. 2013. “Large-scale Social-media Analytics on Stratosphere,” in Proceedings of the 22nd International Conference on World Wide Web Companion, pp. 257-260.

Bollen, J. and Mao, H. 2011. “Twitter Mood as a Stock Market Predictor,” IEEE Computer (44:10), pp.91-94.

Canny, J. and Zhao, H. 2013. “Big data Analytics with Small footprint: Squaring the Cloud,” in Proceedings of the 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, pp. 95-103.

Castillo, C., Mendoza, M., and Poblete, B. 2011. “Information Credibility on Twitter,” in Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, pp. 675–684.

Chen, C., and Tseng, Y. 2011. “Quality Evaluation of Product Reviews using an Information Quality Framework,” Decision Support Systems (50:4), pp.755–768.

Cormack, G. V. 2007. “TREC 2007 Spam Track Overview,” in Proceedings of the 2007 TREC Conference. Available at: http://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW16.pdf.

Dellarocas, C. 2006. “Strategic Manipulation of Internet Opinion Forums: Implications for Consumers and Firms,” Management Science (52:10), pp. 1577–1593.

Diaz-Aviles, E. 2013. “Living Analytics Methods for the Web Observatory,” in Proceedings of the 22nd international conference on World Wide Web companion, pp. 1321-1323.

Esuli, A. and Sebastiani, F. 2005. “Determining the Semantic Orientation of Terms through Gloss Classification,” in Proceedings of the 14th ACM International Conference on Information and Knowledge Management, Bremen, Germany, pp. 617 - 624.

Goldberg D. 1989. Genetic Algorithms in Search, Optimization and Machine Learning, Reading, Massachusetts: Addison-Wesley.

Grier, C., Thomas, K., Paxson, V., and Zhang, C. 2010. “@spam: the Underground on 140 Characters or Less,” in Proceedings of the 17th ACM Conference on Computer and Communications Security, Chicago, Illinois, pp. 27-37.

Hand, D. and Till, R. 2001. “A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems,” Machine Learning (45:2), pp. 171-186.

Hevner, A., March, S., Park, J., Ram, S. 2004. “Design Science in Information Systems Research,” MIS Quarterly (28:1), pp. 75-105.

Page 18: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Decision Analytics, Big Data, and Visualization

18 Thirty Fifth International Conference on Information Systems, Auckland 2014

Hu, N., Bose, I., Koh, N., Liu, L. 2012. “Manipulation of Online Reviews: An analysis of Ratings, Readability, and Sentiments,” Decision Support Systems (52:3), pp. 674–684.

Hullermeier, E. and Vanderlooy, S. 2010. “Combining Predictions in Pairwise Classification: An Optimal Adaptive Voting Strategy and its Relation to Weighted Voting,” Pattern Recognition (43:1), pp.128-142.

Jindal, N. and Liu, B. 2008. “Opinion Spam and Analysis,” in Proceedings of the 2008 International Conference on Web Search and Web Data Mining, pp. 219-229.

Kumar, A., Niu, F., Ré, C. 2013. “Hazy: Making it Easier to Build and Maintain Big-data Analytics,” Communications of the ACM, (56:3), pp. 40-49.

Lau, R.Y.K., Liao, S.Y., Kwok, R.C.W., Xu, K.Q., Xia, Y., and Li, Y. 2011. “Text Mining and Probabilistic Language Modeling for Online Review Spam Detection,” ACM Transactions on Management Information Systems (2:4), article 25.

Lau, R.Y.K., Liao, S.Y., and Xu, K.Q. 2010. “An Empirical Study of Online Consumer Review Spam: A Design Science Approach,” In Proceedings of the Thirtieth First International Conference on Information Systems, Completed Research Paper 35.

Lau, R.Y.K., Tang, M., Wong, O., Milliner, S., Chen, Y. 2006. “An Evolutionary Learning Approach for Adaptive Negotiation Agents,” International Journal of Intelligent Systems (21:1), pp. 41-72.

Lin, J. and Ryaboy, D. 2013. “Scaling Big Data Mining Infrastructure: the Twitter Experience,” SIGKDD Explorations Newsletter (14:2), pp. 6-19.

Liu, J., Cao, Y., Lin, C., Huang, Y., Zhou, M. 2007. “Low-Quality Product Review Detection in Opinion Summarization,” In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 334-342.

Maiti, A. K. and Maiti, M. 2008. “Discounted Multi-item Inventory Model via Genetic Algorithm with Roulette Wheel Selection, Arithmetic Crossover and Uniform Mutation in Constraints Bounded Domains,” International Journal of Computer Mathematics (85:9), pp. 1341-1353.

Mukherjee, A., Kumar, A., Liu, B., Wang, J., Hsu, M., Castellanos, M., Ghosh, R. 2013. “Spotting Opinion Spammers using Behavioral Footprints,” in Proceedings of the 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, pp. 632-640.

Mukherjee, A., Liu, B., and Glance, N. 2012. “Spotting Fake Reviewer Groups in Consumer Reviews,” in Proceedings of the 21st World Wide Web Conference, Lyon, France, pp. 191–200.

Ott, M., Choi, Y., Cardie, C., and Hancock, J. 2011. “Finding Deceptive Opinion Spam by any Stretch of the Imagination,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, pp. 309-319.

Ott, M., Choi, Y., Cardie, C., and Hancock, J. 2011. “Estimating the Prevalence of Deception in Online Review Communities,” in Proceedings of the 21st International World Wide Web Conference, Lyon, France, pp. 201-209.

Ounis, I., Macdonald, C., and Soboroff, I. 2008. “Overview of the TREC-2008 Blog Track”, in Proceedings of the Seventeenth Text Retrieval Conference, Gaithersburg, Maryland, 2008. Available from http://trec.nist.gov/pubs/trec17/.

Piskorski, J., Sydow, M., Weiss, D. 2008. “Exploring Linguistic Features for Web Spam Detection: a Preliminary Study,” in Proceedings of the 4th international workshop on Adversarial information retrieval on the Web, pp. 25-28.

Ratkiewicz, J., Conover, M., Meiss, M., Gonçalves, B., Patil, S., Flammini, A., and Menczer, F. 2011. “Truthy: Mapping the Spread of Astroturf in Microblog Streams,” in Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, pp. 249–252.

Roberge, V., Tarbouchi, M., and Okou F. 2014. “Strategies to Accelerate Harmonic Minimization in Multilevel Inverters Using a Parallel Genetic Algorithm on Graphical Processing Unit,” IEEE Transactions on Power Electronics (29:10), pp. 5087-5090.

Sarawagi, S. 2006. “Efficient Inference on Sequence Segmentation Models,” in Proceedings of the ACM International Conference on Machine Learning, (148), pp. 793-800.

Sun, H., Morales, A., Yan, X. 2013. “Synthetic Review Spamming and Defense,” in Proceedings of the 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, pp. 1088-1096.

Tan, W., Blake, M., Saleh, I., Dustdar, S. 2013. “Social-Network-Sourced Big Data Analytics,” IEEE Internet Computing (17:5), pp.62-69.

Page 19: Adaptive Big Data Analytics for Deceptive Review Detection ...€¦ · Adaptive Big Data Analytics for Deceptive Review Detection in Online Social Media Completed Research Paper Wenping

Big Data Analytics for Detecting Deceptive Online Reviews

Thirty Fifth International Conference on Information Systems, Auckland 2014 19

Tan, H., Luo, W., Ni, L. 2012. “CloST: a Hadoop-based Storage System for Big Spatio-temporal Data Analytics,” in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2139-2143.

Valle, M. 2007. “The Digital Information War,” Online Information Review (31:1), pp. 5-9. Valitutti, A., Strapparava, C., and Stock, O. 2004. “Developing Affective Lexical Resources,” Psychology

(2:1), pp. 61-83. Wang, G., Xie, S., Liu, B., Yu, P.S. 2012. “Identify Online Store Review Spammers via Social Review

Graph,” ACM Transactions on Intelligent Systems and Technology (3:4), article 61. Wasserman, S. and Faust, K. 1999. Social Network Analysis: Methods and Applications, New York:

Cambridge University Press. Wigan, M. and Clarke, R. 2013. “Big Data's Big Unintended Consequences,” IEEE Computer (46:6), pp.

46-53. Xie, S., Wang, G., Lin, S., Yu, P.S. 2012. “Review Spam Detection via Temporal Pattern Discovery,” in

Proceedings of the 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Beijing, China, pp. 823-831.

Xu, K., Liao, S.Y., Lau, R.Y.K., Zhao, L. 2014. "Effective Active Learning Strategies for the Use of Large-Margin Classifiers in Semantic Annotation: An Optimal Parameter Discovery Perspective," INFORMS Journal on Computing, (26:3), pp. 461-483.

Zeng, D. and Lusch, R. 2013. “Big Data Analytics: Perspective Shifting from Transactions to Ecosystems,” IEEE Intelligent Systems, (28:2), pp. 2-5.