[IEEE 2013 Chinese Automation Congress (CAC) - Changsha, Hunan, China (2013.11.7-2013.11.8)] 2013 Chinese Automation Congress - BBS opinion leader mining based on an improved PageRank

978-1-4799-0333-7/13/$31.00 © 2013 IEEE 392

BBS Opinion Leader Mining Based on An ImprovedPageRank Algorithm Using MapReduce

Lincheng Jiang, Bin Ge, Weidong Xiao, Mingze GaoScience and Technology on Information Systems Engineering Laboratory

National University of Defense Technology

Changsha, China

Email: [email protected]

Abstract—Opinion leaders in bulletin board systems (BBS)play an important role during the formation of public opinion.Opinion leader mining has a positive effect on us to grasp andguide public opinion. The paper designs and implements anopinion leader mining system based on an improved PageRankalgorithm using MapReduce. The improved PageRank algorithmuses the method of sentiment analysis to define the weightof the link between users. The system has three main steps.Firstly, adopt web crawler to crawl BBS data and preprocessthe data collected. Then construct an online social network withthe replying relations between posts and comments mapped torelations between posters and comment authors. Finally, usethe improved PageRank algorithm to rank users in BBS inthe Hadoop cloud computing environment. Contrast experimentswith the origin Pagerank algorithm show that the system isaccurate, efficient and practical.

Keywords—opinion leader; Hadoop MapReduce; PageRank;BBS

I. INTRODUCTION

With the rapid development of Internet technology andincreasing of the number of Internet users, more and morepeople express personal opinion and participate in socialdiscussions by means of tools such as blogs, microblogs, andforums on the Internet. Many domestic hot events can quicklyform huge public pressure. The Internet has become one of themain carrier to reflect public opinion. A bulletin board system(BBS) is a convenient online community on the Internet whereBBS users exchange opinions and ideas with clear themesand purposes based on their interests in a group discussionstyle. BBS users can initiate or reply to posts which theyare interested in using their account. They can improve theirauthority with the public recognition of their posts.

During the formation of public opinion, opinion leaders inBBS play an important role. Opinion leaders are those userswho can put forward guiding ideas and have a broad impacton other users [1]. In BBS, opinion leaders have accumulatedhigh reputation. They are normally more interconnected andhave a higher status, education, and social standing. Theirwords and opinions tend to influence and change other users’opinions, which will guide and promote the development ofpublic opinion. Opinion leaders’ influence may be positive ornegative. Opinion leader mining system can provide technicalsupport for timely discovering hot issues and guiding healthydevelopment of public opinion.

In recent years, plenty of literature about online opinionleader mining have been published. Shuai Zhu et al used a X-

means iteration clustering filter model based on the Bayesianinformation gain maximization formula to find out the opinionleader feature points in the feature space after analyzingthe essential attributes of opinion leaders [2]. Weizhe Zhanget al constructed a heterogeneous network to represent therelationship between the topics and replies in a web forum.They identified opinion leaders by quantifying the influencesof forum users in the network [3]. Yanyan Li et al proposedan improved mix framework for opinion leader identificationin online learning communities by analysing textual content,user behaviour and time.They ranked opinion leaders basedon four distinguishing features: expertise, novelty, influence,and activity [4]. Ning Ma and Yijun Liu proposed an Su-peredgeRank algorithm for opinion leader identification basedon supernetwork theory, which combined the network topologyanalysis and text mining [5].

In the background of big data, users and posts in BBSare massively growing. Traditional algorithms are difficult tobalance the computational efficiency and accuracy. To solvethis problem, we designs and implements a BBS opinion leadermining system based on an improved PageRank algorithmusing MapReduce. PageRank is an excellent sorting algorithm,but its running speed decreases significantly with the increaseof the number of data node. Using MapReduce framework canachieve the parallel PageRank algorithm, which will greatlyaccelerate the convergent speed. Meanwhile, The paper usessentiment analysis to define the weight of the users’ links,which can improve the accuracy of the PageRank algorithm.

The rest of the paper is organized as follows. In Section IIwe present some related works briefly. The improved PageR-ank algorithm and the improved PageRank algorithm usingMapReduce are described in detail in section III. Section IVgives an introduction to the opinion leader mining system. Theexperiment results and analysis are presented in Section V. Weconclude this paper in Section VI with a summary.

II. RELATED WORKS

A. PageRank

PageRank is a link analysis algorithm and it is used bythe Google web search engine to measure the importance ofweb pages. It was proposed by Larry Page and Sergey Brinin the late 1990s [6] . PageRank takes advantage of the linkstructure to indicate the value of web pages. It’s like that allthe other pages on the Internet vote for a particular page. A

393

link pointing to a page represents a support ticket. If there isno link pointing to it, there is no support ticket.

PageRank value is defined as :

PR(pi) =(1− d)

N+ d×

∑pj∈M(pi)

PR(pj)

L(pj)(1)

where pi, p2, ..., pN are the pages under consideration, PR(pi)is the Pagerank value of page pi, M(pi) is the set of pages thatlink to pi, L(pj) is the number of outbound links on page pj ,N is the total number of pages, and d is a damping parameterbetween 0 and 1, which is often set to 0.85 [7]. PageRank is aprobability distribution and the sum of all the PageRank valueis 1.

Fig. 1. A Schematic Diagram of PageRank

Note that PageRank value of page pj is divided among itsforward links evenly to contribute to the rank of the pages itpoints to. Fig. 1 demonstrates the propagation of rank fromone pair of pages to another. Fig. 2 shows a consistent steadystate solution for a set of pages.

Fig. 2. Steady State of PageRank

B. MapReduce

MapReduce is a programming model for processing largedata sets with a parallel, distributed algorithm on a cluster.The model has a Map function and a Reduce function [8], [9],

[10], [11]. The MapReduce system designates Map processors,assigns the K1 input key value which each processor wouldwork on, and provides that processor with all the input dataassociated with that key value. Map is run exactly once foreach K1 key value, generating output organized by key valuesK2. The MapReduce system designates Reduce processors,assigns the K2 key value which each processor would work on,and provides that processor with all the Map-generated dataassociated with that key value. Reduce is run exactly once foreach K2 key value produced by the Map step. The MapReducesystem collects all the Reduce output, and sorts it by K2 toproduce the final outcome. Fig. 3 is a schematic diagram ofMapReduce.

Fig. 3. A Schematic Diagram of MapReduce

At present, there are two popular MapReduce programmingframeworks: Google MapReduce and Hadoop MapReduce[12]. Hadoop MapReduce is achieved on the basis of GoogleMapReduce. It is an open source solution, thus we use HadoopMapReduce programming framework in this paper. HadoopMapReduce framework consists of a master service namedJobTracker and several slave services named TaskTracker run-ning on multiple nodes. Master is responsible for schedulingeach subtask of jobs running in Slave and monitoring them. Iffailed tasks are found, re-run them. TaskTracker needs to runin the HDFS datanode, while JobTracker does not. In general,JobTracker should be deployed on a separate machine. Fig. 4shows the scheduling process of Hadoop MapReduce.

Fig. 4. Hadoop MapReduce Scheduling Process

394

III. IMPROVED PAGERANK AND IMPROVED PAGERANK

USING MAPREDUCE

A. Improved PageRank

The original PageRank algorithm assumes that the proba-bility of every page pj linking to page pi is the reciprocal ofpage pj’s out-degree. That is to say that the probability of pagepj linking out to any of the next hop node is the same. But inthe social network of BBS, for a special user, the weights ofdifferent out-links are certainly different. So we need to givean appropriate weight to fit the real social relations betweenusers.

In this paper we use sentiment analysis of posts to definethe weight of the link between users, which has similarly beenused by Xiao Yu [13]. The emotion of a reply can reveal thereplier’s positive, negative or neutral attitude to the author. InBBS, user pj may reply to user pi several times. The averageemotion value of user pj’s replies to user pi is used to representthe weight of the link user pj to user pi. Its formula is:

wji =

∑sji

tji(2)

where sji is the emotion value of replies from user pj to userpi in an article chain, and

∑sji is the summation of emotion

values in all article chains. tji is the times user pj replies touser pi.

Calculate the weight between users, and finally we get theweight matrix.

W =

⎛⎜⎝

w11 . . . w1N

.... . .

...wN1 · · · wNN

⎞⎟⎠ (3)

where N is the number of users in BBS.

At this circumstance, the form of the PageRank algorithmis also changed. In the original PageRank algorithm, the weightbetween users is 0 or 1. But in the improved PageRank algo-rithm, it becomes W . The formula of the improved PageRankalgorithm is:

PR(pi) =1− d

N+ d×

∑ PR(pj)× wji

C(pj)(4)

C(pj) =∑

pk∈T (pj)

wjk (5)

where pi, p2, ..., pN are the users in BBS, PR(pi) is thePageRank value of user pi, M(pi) is the set of users who linkto user pi, N is the total number of users, wji is weight ofthe link from user pj to pi, T (pj) is the set of users whoare linked by user pj , and C(pj) is the sum of pj’s out-linkweights’ absolute values. d is the damping coefficient. Thevalue of d is set to 0.85.

B. Improved PageRank Using MapReduce

Improved PageRank using Hadoop MapReduce is de-scribed as follows:

1) Step 1: For all the target node, output <key: Target,value: partial PageRank value from starting node>;

2) Step 2: Hadoop MapReduce framework classifies valueswith the same key. At the Reduce stage, for every key, sum upthe partial PageRank value with the same key to get the newPageRank value for all pages. The result is stored in HDFSin Hadoop MapReduce and will be served as input values forthe next cycle;

3) Step 3: Determine whether the algorithm has beenconvergent or not. If it has , stop iterating. If not, combinethe output results of step 1 and step 2 as input and continueiterating the improved PageRank algorithm.

The process is implemented in Hadoop cloud comput-ing cluster. Below are pseudo-code of Mapper function andpseudo-code of Reducer function:

Algorithm 1 Map (key, value)

Input:BBS user piPR(pi) : the PageRank value of user pilinks[p1, p2, ..., pm]: all the users pj linked by user pi

Output:list of <key : value>

1: Emit(pi, links[p1, p2, ..., pm])2: For each pj in links[]3: partial(j)= PR(pi)× wij/C(pi)4: Emit(pj , partial(j))5: End For

Algorithm 2 Reduce (key, value)

Input:BBS user pj list of < pj : partial(j) >

Output:PR(pj) : the PageRank value of user pj

1: //Initial new PageRank value of user pj2: PR(pj)=03: For each partial(j) in the list4: PR(pj) += partial(j)5: End For6: PR(pj) = d× PR(pi) + (1− d)/N

IV. SYSTEM DESIGN AND IMPLEMENTATION

BBS opinion leader mining system integrates to use webcrawler, regular expression matching, sentiment analysis, thePageRank algorithm, cloud computing and some other tech-nologies. Fig. 5 shows the structure of the system. The systemis mainly composed of the following four sub-modules: BBSdata crawling module, data preprocessing module, sentimentanalysis module and opinion leader mining module.

A. BBS Data Crawling

Crawling web pages is the basis of public opinion research.Web crawling needs a web crawler. A web crawler starts witha list of URLs to visit, called the seeds. As the crawler visitsthese URLs, it identifies all the hyper links in the page and addsthem to the list of URLs to visit, called the crawling frontier.URLs from the frontier are recursively visited according to a

395

Fig. 5. The Structure of The System

set of policies. Web crawlers’ main function is to copy all thepages they visit for later processing so that users can analyzethem much more quickly.

In this paper, an improved Netcrawler [14] is used. Usersneed to set the seed URL of the BBS to be researched,the number of threads, storage location, crawling depth andcrawling type. The data collected in the form of Html files arestored locally for subsequent modules.

B. Data Preprocessing

Data preprocessing is to change the unstructured, semi-structured web pages into structured data. More specifically,this step is to analyze the content of the web pages collectedin the form of Html, and use the regular expression matchingto extract important information from the body of the pages,including the post title, the poster, comment authors andcomment content.

This process requires flexibility in the use of regularexpressions, eliminating pages of various interfering elements.After data preprocessing we can store the extracted usefulinformation in the database for subsequent modules.

C. Sentiment Analysis

Analysis of the relationship between BBS users, in additionto examining whether there are links between them, we shouldanalyze the weight of the links. In this paper, we use thesentiment analysis. Sentiment analysis is the field of study thatanalyzes people’s opinions, sentiments, evaluations, appraisals,attitudes, and emotions towards entities such as products,services, organizations, individuals, issues, events, topics, andtheir attributes [15]. In this paper, a method based on theemotion corpus is used to make text orientation analysis. TheBBS emotion corpus is established by machine learning, whichregard the previous BBS posts as training set and testing set.The established corpus gives each emotional word a numericalscore. For example, the score of good” is 1, the score of ”verygood” is 2, the score of ”poor” is -1, and the score of ”verypoor” is -2.

After establishing the corpus, we can score for each com-ment. For example, assume there is a comment ”The analysis

is very good, but a little extreme”. The words ”very good”and ”extreme” have expressed the comment author’s emotionalorientation the score of ”very good” is 2 and the score of”extreme” is -1. Sum up them to get the score of the comment.The result is 1.

D. Opinion Leaders Mining

Improved PageRank using MapReduce is adopted in thisstep. The algorithm has been described in detail in section III.Here we mainly describe two issues which should be noted.

1) Dangling Users: In BBS, these users who do not containlinks to other users are called dangling users, and the numbermay be very big [16] . The rows in the weight matrix Wcorresponding to dangling users would be zero if left untreated.Several ideas have been proposed to deal with the zero rowsand force W to be stochastic [17] . The most popular approachadds artificial links to the dangling users, by replacing zerorows in the matrix with the same vector so that the matrix Wis stochastic.

2) Rank Sink: In BBS, two or more users might connectto each other to form a loop. If these users did not refer to butwere referred to by other users outside the loop, they wouldaccumulate rank but never distribute any rank. This scenario iscalled rank sink. If there are nodes leading to rank sink, thesenodes must be deleted before running the algorithm.

V. EXPERIMENTAL RESULTS AND ANALYSIS

A. Experimental Data

The experimental data were crawled from a famous BBS ofTianya, a globally influential online community. The statisticsdata of the BBS network’s crawled data are shown is Table 1.

TABLE I. STATISTICS OF THE BBS NETWORK

Statistics subject value

Number of users in the data set 374302

Number of posts in the data set 357283

Number of comments in the data set 4702214

B. Experimental Environment

The test environment built in the paper is shown in Table 2.Because of a large amount of experimental data, opinion leadermining is conducted in the Hadoop cloud computing platformcomposed of four servers using the MapReduce method. Eachserver has two Map, total up to eight Map.

TABLE II. TEST ENVIRONMENT

Computing cluster composed of four computers

CPU Intel Core i5-3210M 2.50GHz

Memory DDR3 4G

Hard Disk Drive 750G

396

C. Experiment and Analysis

Set the initialization PageRank value of each user to1/374302. Set iterative convergent error to 0.00001. Select top10 scoring users as opinion leaders. The ranking results areshown in Table 3.

TABLE III. RANKING RESULT

Ranking Tianya Ranking Improved PageRank PageRank

1 dtrader99 dtrader99 dtrader99

2 xiaozhu909 xiaozhu909 xiaozhu909

3 stswin stswin Tangxuewei2010

4 Tangxuewei2010 Tangxuewei2010 stswin

5 Yiyu Chengchen Yiyu Chengchen Lov vinccy

6 Shuishi ShuiFei Shuishi ShuiFei Shuishi ShuiFei

7 Lov vinccy Lov vinccy Yiyu Chengchen

8 keke6666 keke6666 keke6666

9 Naizhen Pan Xinghua Pan Xinghua

10 Deng Haozhi Deng Haozhi Xunyeren

It is easily seen from Table 3 that the mining result of theimproved PageRank algorithm is nearly the same as Tianyaranking. Just considering the opinion leaders, regardless oftheir ranking, the accuracy rate of the improved PageRank ishigh up to 90 % while the accuracy of the original PageRank is80 %. If considering the ranking place of opinion leaders, theaccuracy rate of the improved PageRank is still 90 % while theaccuracy of the original PageRank is only 40 %. The accuracyof improved PageRank is much higher than that of originalPageRank.

The system can not only obtain high accurate mining re-sults. The greater advantage of the system is its high efficiency,for it has adopted cloud computing platform. When runningthe improved PageRank, the average time for the computingcluster to iterate once is 240 seconds. The time of the originalPageRank is 1820 seconds. More over, With the increase of thenumber of users, the original PageRank algorithm will be undermore pressure, while the algorithm designed in this paper caneasily solve this problem by adding computing resources.

VI. CONCLUSION

In this paper, the improved PageRank algorithm usingMapReduce is adopted to mine BBS opinion leaders. Thepaper has designed the structure and process of the miningsystem. More over, it has integrated to use a variety of com-puter technologies to make the system operate well, such asweb crawler, regular expression matching, sentiment analysis,PageRank, cloud computing and so on. Since the core moduleof the system uses parallel programming model and runs inthe cloud platform , it can quickly get accurate mining resultsfrom big data sets. The system has high practical value. Withthe growing importance of the BBS opinion leaders, the systemwill attract more attention .

ACKNOWLEDGMENT

This work is supported by National Natural ScienceFoundation of China No. 60903225, Doctoral Program ofHigher Specialized Research Fund No. 20114307110008 andNational Science and Technology Support Program NO.2012BAH08B01.

REFERENCES

[1] K. Song, D. Wang, S. Feng, D. Wang, Detecting positive opinion leadergroup from forum, LNCS, vol. 7418, 2012, pp. 95-101.

[2] S. Zhu, X. Zheng, D. Chen, Research of algorithm for automatic opinionleader detection in BBS, System Engineering Theory and Practice, vol.31, 2011, pp. 7-12.

[3] W. Zhang, B. Wang, H. He, Z. Tan, Public opinion leader communitymining based on the heterogeneous network, Acta Electronica Sinica,vol. 40, 2012, pp. 1927-1932.

[4] N. Ma, Y. Liu, SuperedgeRank algorithm and its application in identify-ing opinion leader of online public opinion supernetwork, Expert Systemswith Applications, in press.

[5] Y. Li, S. Ma, Y. Zhang, R. Huang, Kinshuk, An improved mix frame-work for opinion leader identification in online learning communities,Knowledge-Based Systems, vol. 43, 2012, pp. 43-51.

[6] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Rank-ing: Bringing Order to the Web, Technical report, Stanford University,1998.

[7] S. Brin and L. Page, Reprint of: The anatomy of a large-scale hypertextu-al web search engine, Computer Networks, vol. 56, 2012, pp. 3825-3833.

[8] L. Ralf, Google’s MapReduce programming model, Science of ComputerProgramming, vol. 68, 2007, pp. 208-237.

[9] A. McNabb, C. Monson, K. Seppi, Parallel PSO Using MapReduce,Proc. IEEE Congress on Evolutionary Computation, (CEC 07), IEEEPress, Sep. 2007, pp. 7-14

[10] D. Jeffrey and G. Sanjay, MapReduce: Simplified data processing onlarge clusters, Communications of the ACM, vol. 51, 2008, pp. 107-113.

[11] E. Jaliya, P. Shrideep, F. Geoffrey, MapReduce for data intensivescientific analyses, Proc. IEEE International Conference on eScience,(eScience 08), IEEE Press, Dec. 2008, pp. 277-284

[12] D. Borthakur, The Hadoop Distributed File System: Architecture andDesign, The Apache Software Foundation, 2007.

[13] Y. Xiao, W. Xu, L. Xia, Algorithms of BBS Opinion Leader MiningBased on Sentiment Analysis, LNCS, vol. 6318, 2010, pp. 360-369.

[14] Netcrawler, www.net-crawler.org

[15] B. Liu. Sentiment analysis and opinion mining, Synthesis Lectures onHuman Language Technologies, 2012, pp. 1-167.

[16] I. Ipsen and T. Selee, Pagerank computation, with special attention todangling nodes, SIAM Journal on Matrix Analysis and Applications, vol.29, 2007, pp. 1281-1296.

[17] N. Eiron, K. McCurley, J. Tomlin, Ranking the Web Frontier, Proc.International conference on World Wide Web, (WWW 04), ACM, May.2004, pp. 309-318

Documents

[IEEE 2013 Chinese Automation Congress (CAC) - Changsha, Hunan, China (2013.11.7-2013.11.8)] 2013 Chinese Automation Congress - BBS opinion leader mining based on an improved PageRank