58
SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH APACHE HADOOP A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI‘I AT MĀNOA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN ELECTRICAL ENGINEERING DECEMBER 2014 By Qiuling Kang Thesis Committee: Xiangrong Zhou, Chairperson Galen Sasaki Rui Zhang Keywords: Twitter, Analysis Sentiment, Hadoop MapReduce, HDFS

SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

  • Upload
    buinhu

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH APACHE HADOOP

A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE

UNIVERSITY OF HAWAI‘I AT MĀNOA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

IN

ELECTRICAL ENGINEERING

DECEMBER 2014

By

Qiuling Kang

Thesis Committee:

Xiangrong Zhou, Chairperson

Galen Sasaki

Rui Zhang

Keywords: Twitter, Analysis Sentiment, Hadoop MapReduce, HDFS 

Page 2: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

ii

ACKNOWLEDGMENTS

I would like to express my appreciation to my advisor professor Zhou, for his patience and help

throughout my master program. Due to his generous and valuable suggestions, I can complete

my master project on time. I also would like to thank Professor Sasaki for his guidance during

my study in EE department and reviewing this manuscript. In addition, I would like to thank

Professor Zhang, for reviewing my thesis. Based on his insightful suggestions, my thesis could

be better.

Page 3: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

iii

ABSTRACT

Twitter is a microblog service and is a very popular communication mechanism. Users of Twitter

express their interests, favorites, and sentiments towards various topics and issues they

encountered in daily life, therefore, Twitter is an important online platform for people to express

their opinions which is a key fact to influence their behaviors. Thus, sentiment analysis for

Twitter data is meaningful for both individuals and organizations to make decisions. Due to the

huge amount of data generated by Twitter every day, a system which can store and process big

data is becoming a problem. In this study, we present a method to collect Twitter data sets, and

store and analyze the data sets on Hadoop platform. The experiment results prove that the present

method performs efficient.

Page 4: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

iv

TABLE OF CONTENTS

Acknowledgments........................................................................................................................... ii

Abstract .......................................................................................................................................... iii

Table of Contents ........................................................................................................................... iv

List of Tables ................................................................................................................................. vi

List of Figures ............................................................................................................................... vii

Chapter 1 Introduction .................................................................................................................... 1

1.1 Background ........................................................................................................................... 1

1.2 Motivation ............................................................................................................................. 2

1.3 Contribution of the Thesis ..................................................................................................... 5

1.4 Thesis Overview .................................................................................................................... 5

Chapter 2 Twitter Data Collection and Storage .............................................................................. 6

2.1 Twitter Data Collection ......................................................................................................... 6

2.1.1 Twitter API Introduction ................................................................................................ 6

2.1.2 Tweepy Introduction and Installation ............................................................................. 8

2.1.3 Collection Procedure .................................................................................................... 12

2.2 Storage in HDFS Filesystem ............................................................................................... 17

Chapter 3 Sentiment Analysis of Tweets in Hadoop System ....................................................... 21

3.1 Algorithm Selection ............................................................................................................ 22

3.1.1 Decision Trees .............................................................................................................. 22

3.1.2 Naive Bayes Classifiers ................................................................................................ 24

3.1.3 Support Vector Machines ............................................................................................. 27

3.2 Sentiment Analysis of Tweets ............................................................................................. 30

3.2.1 Extract Feature from Tweets ........................................................................................ 30

3.2.3 Classifier ....................................................................................................................... 30

3.3 Run on Hadoop .................................................................................................................... 32

3.3.1 Hadoop MapReduce and HDFS ................................................................................... 32

Page 5: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

v

3.3.2 MapReduce Functions .................................................................................................. 34

Chapter 4 Experiments and Results .............................................................................................. 38

4.1 Scottish Independence Vote Analysis ................................................................................. 38

4.2 Some Hawai‘i Tourism Sites Analysis................................................................................ 42

4.3 Performance Environment ................................................................................................... 46

Chapter 5 Conclusion and Open Issues ........................................................................................ 48

References ..................................................................................................................................... 49

Page 6: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

vi

LIST OF TABLES

Table 1.1 Traditional RDBMS compared to Hadoop ......................................................... 5

Table 3.1 Words with sentiment polarity in tweet ............................................................ 25

Table 4.1 The environment of the experiment .................................................................. 47

Page 7: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

vii

LIST OF FIGURES

Figure 1.1 The increasing trends of tweets in recent years ................................................. 1

Figure 2.1 the process of requesting from REST APIs ....................................................... 7

Figure 2.2 Streaming APIs working procedure .................................................................. 8

Figure 2.3 Twitter dataset collection procedure ................................................................. 9

Figure 2.4 Create a Twitter application ............................................................................ 10

Figure 2.5 Create a Twitter application ............................................................................ 11

Figure 2.6 Obtain token .................................................................................................... 12

Figure 2.7 The procedure of data collection ..................................................................... 13

Figure 2.8 Sample of collected files ................................................................................. 15

Figure 2.9 Architecture of HDFS...................................................................................... 18

Figure 2.10 HDFS workflow ............................................................................................ 18

Figure 3.1 The process of sentiment analysis ................................................................... 21

Figure 3.2 An example of a decision tree. ........................................................................ 23

Figure 3.3 Naive Bayes classifier represented by graph ................................................... 27

Figure 3.4 Example of SVM maximum margin and margin ............................................ 28

Figure 3.5 The workflow of opinion mining .................................................................... 32

Figure 3.6 The Hadoop MapReduce and HDFS architecture ........................................... 33

Figure 3.7 A client submit a job to MapReduce ............................................................... 34

Figure 3.8 The Mapper flowchart ..................................................................................... 36

Figure 3.9 The reducer flowchart ...................................................................................... 37

Figure 4.1 Data analysis process ....................................................................................... 38

Figure 4.2 The curve for tweets based on different keywords .......................................... 39

Figure 4.3 Scottish independence vote polarity values ..................................................... 40

Figure 4.4 The attitude distribution based on different keywords .................................... 41

Figure 4.5 Values of positive vs negative ......................................................................... 42

Figure 4.6 Distribution for attitude polarity based on keyword “Hawaii” ....................... 43

Figure 4.7 Distribution for attitude polarity based on keyword “Waikiki” ...................... 44

Figure 4.8 Distribution for attitude polarity based on keyword “Diamond head” ............ 45

Page 8: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

viii

Figure 4.9 Distribution for attitude polarity based on keyword “Hanauma bay” ............. 46

Page 9: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

1

CHAPTER 1 INTRODUCTION

1.1 Background

Microblogging websites have become one of major source information. Twitter is one

such popular communication microblog which is an online social networking platform

that allows people to publish messages to express their interests, favorites, opinions, and

sentiments towards various topics and issues they encountered in their daily life. The

messages are called tweets which are real-time and at most 140 characters per one [1].

There are about 200 billion tweets per year, 500 million tweets per day, 350,000 tweets

per minute, and 6,000 tweets per second are published [2]. Figure1.1 shows the

increasing trends of twitter in recent years. Such a huge amount of data can be efficiently

used for social network studies and analysis to gain useful and meaningful results.

Figure 1.1 The increasing trends of tweets in recent years [2]

There is previous research on sentiment analysis of Twitter data. Pak and Paroubek (2010)

perform linguistic analysis of the collected tweets and they show the method to build a

sentiment classifier using training data. [4]. Sitaram Asur and Bernardo A.Huberman

Page 10: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

2

(2010) proved that social media content can be utilized to predict real world performance.

They built a linear regression model for forecasting the box-office revenues of movies[5].

Apoorv Agarwal and Boye Xie introduce POS-specific prior polarity features and

designed new tree for the tree kernel based model. (2011)[6]. Hsiang Hui Lek and Danny

C.C. Poo (2013) proposed an aspect-based sentiment classification which improve the

existing tweet level classifiers [7].

Classification techniques are fundamental to analyze sentiment of social data. S.B.

Kotsiantis(2007) review some recent classification techniques. They discussed and

compared several supervised learning algorithms: logical based algorithm such as

decision trees, rule based algorithms[3]; perceptron –based techniques, such as single

layered perceptrons and multilayered perceptrons; Radial Bassis Function (RBF)

networks; statistical learning algorithms such as Naive Bayes classifier and Bayesian

networks; instance-based learning; and Support Vector machines.

1.2 Motivation

Since the number of internet users of social networking platforms and services grows fast,

more and more data from these platforms can be used for data mining studies. For

example, government may be interested in people’s attitude toward the vote. They may

prefer to predict the vote result like the question in the following [4]:

1. Could the new policy get the most people’s support?

2. How positive (or negative) are people about the new policy?

3. What kind of people has the most influence on the result?

Also, the local tourism company may interested in which place is popular among tourists.

They may like to know the following questions’ answer [4]:

1. Which is most visited tourist attractions for tourists?

Page 11: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

3

2. How positive (or negative) are people about the tourist attraction?

3. Which time do people prefer to travel outside in a day?

In this thesis, we show how we use the datasets from microblogging platform to do data

mining. We collected big datasets from the Twitter database and analyzed them. There

are several reasons we use Twitter data set for opinion mining purpose [4]:

1. Valuable data source: twitter is social network platform used by various people to

post their opinions on various topics and discuss current issues.

2. Sufficient data: the volume of tweets grows at high rates, sufficient data could be

gathered for data mining.

3. Variety source of users: internet users could be from variety groups of people, for

example, researchers, politicians, students, farmers, workers, artists, and so on.

We collected more than one million tweets published from Twitter. They are separated

into two sets of tweets:

1. Two thirds of tweets concerning the topic of Scotland independence vote event in

September 2014.

2. One third of tweets talks about the tourism in Hawai‘i of U.S.

As we are going to calculate huge amount of datasets, the problem of how to store large

datasets and improve the performance of calculating has become significant and cannot

be ignored. There are several reasons that we use Hadoop System instead of Rational

Database Management System(RDBMS) this study:

Access speed [8][9]

Although the storage capacity of a single disk is considerable due to the significant

increase of hard drives’ development over the years, the access speed of data from the

disk has not kept up. It usually cost a lot of time to read data from a hard drive. However,

Page 12: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

4

MapReduce model built on Hadoop is effective when unstructured data are combined

from different nodes for merging and sorting.

Data duplication [8][9]

It is necessary to duplicate data to distinct storage systems to avoid the problems brought

by hardware failure. While, we cannot use RDBMS with many disks to do the large scale

calculation, since if seek time dominates the data access time, data access will take longer

to read and write large portions of the dataset than streaming through it. That is to say, if

we update large portions of database, the RDBMS works less efficient than MapReduce

built on Hadoop. Distributed File System (HDFS) can store data in distributed systems

which duplicate data sets on a cluster of hard drives to avoid the failure coming with one

disk failure.

Linear Scalability [8][9]

When Gigabytes of structured data is computed, RDMBS need to be highly integrated.

However, Hadoop models can store very large datasets and process Petabytes of

structured, semi-structured or unstructured data with linear scaling and low integrity. If

we increase the number of clusters, the speed of processing data is increased

proportionally. It does not work for SQL queries. Table 1.1 shows the comparison

between traditional RDBMS and Hadoop modules.

We show how to collect datasets from twitter database via Twitter API and preform

sentiment analysis of the collected datasets on Hadoop distributed system.

Page 13: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

5

Table 1.1 Traditional RDBMS compared to Hadoop [9]

1.3 Contribution of the Thesis

The thesis presents a method to collect a huge amount of datasets which is concerning

some specific topics from Twitter database via Twitter API. This study extracts features

from Tweets and use sentiment classifier to classify the tweets into positive attitude and

negative attitude classes for analyzing people’s opinion toward specific topics and issues.

This study store and analyzes the datasets by using HDFS and Hadoop MapReduce

models respectively, which are more scalable and efficient than traditional RDBMS. The

experiment results prove that the present method performs efficient.

1.4 Thesis Overview

The rest of paper is organized as follows. A twitter sentiment analysis and research

background on Hadoop MapReduce is given in chapter 2. The Twitter data set collection

and storage is presented in Chapter 3, and a sentiment analysis on Hadoop system is

introduced in Chapter 4. In Chapter 5, we describe the experiment and results. Finally, we

conclude the thesis in Chapter 6 and indicate future work.

Page 14: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

6

CHAPTER 2 TWITTER DATA COLLECTION AND

STORAGE

Millions of tweets are generated by Twitter users per day [4]. Through the Twitter API

(Application Programming Interface), researchers and developers can collect a large

public data set from Twitter database. Twitter provides two types of APIs for users to

access the Twitter data: REST APIs and Streaming APIs [22]. Users need to request the

information explicitly to retrieve tweets with REST APIs that allow users to access some

of the core primitives of Twitter including timelines, status updates, and user information.

While users can collect stream of public information continuously with Streaming APIs

that allow users to request for real-time large quantities with specific type of data filtered

by specified and tracked keyword, geographic area, user, or a random sample. As long as

a long lived connection is maintained, users can get a continuous stream of updates. In

this research, we retrieve Twitter data via Streaming APIs based on our research

objective – analyzing user sentiments about given topics that requires collecting twitter

messages published by users. We need to use the authentication method supported by

Twitter to make calls to Twitter’s APIs. Twitter use OAuth (Open Authentication) which

is an open standard for authentication to protected information. After obtaining the data

set, we store the data in HDFS.

2.1 Twitter Data Collection

2.1.1 Twitter API Introduction

Twitter has two types of APIs for user to access Twitter data: REST APIs and Streaming

APIs. The REST APIs do not require users keeping a persistent HTTP (Hypertext

Treansfer Protocol) connection open. User makes one or more requests to a web

Page 15: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

appli

2.1 sh

Twitt

flowi

to the

serve

Twitt

Befor

parse

result

cation, and t

how the proc

F

ter offers thr

ing with pub

e view of a

ers accessed

ter Streamin

re the result

ed, filtered a

ts from the d

then the use

cess of reque

Figure 2.1 th

ree types of e

blic tweets; U

single user;

to many Tw

g APIs. Figu

is stored int

and/or aggre

data store [14

er will receiv

esting from R

e process of

endpoints fo

User streams

Site stream

witter users.

ure 2.2 show

to a data stor

egated first.

4].

7

ve the result

REST APIs

f requesting f

or Streaming

s are single-u

ms are multi-

In this rese

ws how the st

re, the Twee

To respond

ts to the use

[14].

from REST

g APIs: Publi

user streams

-user stream

earch, we co

treaming AP

ets input as a

d to user req

er’s initial re

APIs[14]

ic streams ar

which are c

ms which are

ollect real-tim

PIs work.

a streaming p

quests, the H

equest. Figur

re public dat

correspondin

e intended fo

me tweets vi

process that

HTTP querie

re

ta

ng

or

ia

is

es

Page 16: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

2.1.2

Twee

It sup

secur

more

appli

perm

In ord

token

Twitt

vario

down

2 Tweepy

epy is open-s

pports acces

re its informa

secure; (2)

cation will

missions are e

der to get da

n first and th

ter API. Tw

us paramete

nloaded and

Figure 2.2

Introduct

sourced and

sing Twitter

ation. OAth

It conceal th

still work,

easily to be m

ata from Stre

hen install T

weepy API c

ers and retu

installed on

2 Streaming

tion and In

provides ac

r through OA

offers sever

he user’s pas

since the ap

managed [20

eaming APIs

Tweepy pack

lass provide

urn response

the Ubuntu

8

APIs workin

nstallation

cess to docu

Ath which is

ral benefits:

ssword; (3)

pplication do

0][21].

s, our applic

kage which

es access to

data. There

Linux syste

ng procedur

n

umented Twi

s the only w

(1) it can m

if the user c

oesn’t reply

cation should

is a Python

the twitter

efore a copy

em. After the

re [14]

itter API by

way adopted

make the user

changes the p

y on a passw

d obtain an O

library for

API method

y of Tweep

e collection

using Pytho

by Twitter t

r’ informatio

password, th

word; (4) th

OAuth acces

accessing th

ds that accep

py package

Python scrip

on.

to

on

he

he

ss

he

pt

is

pt

Page 17: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

9

is run to collect data and store the dataset on the local server. The whole collecting

procedure is shown in Figure 2.3.

Figure 2.3 Twitter dataset collection procedure

In this paper, the Tweepy package which requires Python 2.5 or later has been installed in

Ubuntu Linux system. There are three steps to complete the installation process.

1. Download the tweepy package in local server

$ git clone git://github.com/tweepy/tweepy.git

2. Go into the tweepy file.

$ cd tweepy

3. Install tweepy using administration or root privilege.

$ python setup.py install

In order to start the collection process, a client application is registered and a new

application is also created with Twitter. We log in the portal, and then go to “My

Applications”. After filling the information shown in Figure 2.4, a new application can be

created. We use the application information to communicate with Twitter API to retrieve

data sets. As shown in the figure, we input the application name—“For Hadoop” for

“Name” blank, input “This application is used for test” in “Description” blank, input

placeholder since we do not have a URL in “Website” blank. At last, we check “Yes, I

agree” and click “Create your Twitter application” to complete the creating process.

Page 18: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

10

Figure 2.4 Create a Twitter application

After creating the application, we generate the access token. As it is shown in Figure 2.5,

we click “Create my access token” at the end of form.

Page 19: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

11

Figure 2.5 Create a Twitter application

Next, we get the access token and access token secret presented in Figure 2.6. All the

information we need to communicate with Twitter API are included in the figure, owner,

owner ID, API key, and API secret.

Page 20: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

12

Figure 2.6 Obtain token

2.1.3 Collection Procedure

The collection procedure is presented in the Figure 2.7. It includes four steps as follows:

Page 21: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

13

Set OAuth authentication with Username and

password

Start

Set request parameters

Set filter method

Access to twitter API and collecting data

End

Figure 2.7 The procedure of data collection

1. Set OAuth authentication with tokens using Tweepy: twitter utilizes OAuth to

provide authorized access to its API and requires all requests to use OAuth for

authentication [11]. In the following code, we show how to use Tweepy with

OAth to access the Twitter API.

1. consumer_key = "tCYTPMwiWLXyNBdCS9Ipg" 2. consumer_secret = "7BFXcq07s5y4YrjwjP6p3t4cYu0ojeTFG9vq98rE8" 3. access_token = "37188238-Nu3991UKfyVIjacGHNnxKmBykHj5W5zX0g89kN4k" 4. access_token_secret = "KuINYWTDE1fd5QVmRlVsMBmLTDdgMoq2MnyFmo4pG7gv1" 5. auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 6. auth.set_access_token(access_token, access_token_secret) 7. api = tweepy.API(auth)

Page 22: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

14

We use the consumer_key, consumer_secret, access_token, and access_token_secret as

shown in Figure 2.6 to create the OAth access. Line5 and line 6 show how OAth process

works. In line 7, we create actual interface using authentication.

2. Set request parameters: in this dissertation, a list of key words and longitude,

latitude pairs are used to specify the Tweets that will be returned from the twitter

stream. I set the parameters in the following format:

track = [‘key words 1’, ‘key words 2’, ‘key words 3’] follow = [] geo_location = [-158.25, 21.31, -157.628, 21.71, -156.674, 20.59, -155.99, 21.017, -159.77, 21.89, -159.3, 22.24, -157.310, 21.057, -156.69, 21.21]

At least of one of the three parameters track, follow, and location should be specified.

The parameter “track” is a list of keywords to track. A list of phrases which are used to

determine what Tweets will be called back is separated by comma. A phrase can contain

one or several words which are separated by spaces must less than 60 bytes, inclusive

[10][12].

The parameter “follow” is a list of user IDs separated by commas to track. We can collect

the tweets of a specific users by setting the “follow” parameters [10][12].

The parameter “location” is a list of longitude, latitude pairs which can set a box of

bounding of geometric area to track. All the tweets in the area we set will be retrieved

[10][12]. For example with setting of the longitude and latitude pairs to bound the

location of Hawai‘i Islands we can track tweets from Hawai‘i area.

3. Use filter to collect a bunch of tweets matching the request parameters tweets. We

use filter() to pass parameters. We use filter() in the format of following codes:

stream.filter(track = track, follow = follow, locations = geo_location)

Page 23: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

15

4. The tweets that match one or more filter parameters are returned and stored in the

local server. These tweets are encoded in JSON (JavaScript Object Notation)

which is a lightweight data-interchange format. This format is easy for humans to

read and write and for machines to parse and generate [13]. In this thesis, the

tweepy package is utilized for accessing the Streaming API to gather the tweet

back encoded in JSON Figure 2.8 presents part of Twitter data set we collected.

Figure 2.8 Sample of collected files

More than one million of tweets have been collected from Twitter API in this thesis. One

of the tweets is shown in the following:

{"created_at":"Mon Aug 11 18:50:45 +0000 2014", "id":498904369136680961, "id_str":"498904369136680961", "text":"@CrankyDad I need to visit you the next time I'm over there meeting with Jay and Sara.", "source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e", "truncated":false, "in_reply_to_status_id":498903812942209024, "in_reply_to_status_id_str":"498903812942209024", "in_reply_to_user_id":19426562, "in_reply_to_user_id_str":"19426562", "in_reply_to_screen_name":"CrankyDad", "user":{"id":15705560, "id_str":"15705560", "name":"JulieFord", "screen_name":"JulieFord808", "location":"Honolulu, Hawaii", "url":"http:\/\/www.SchweitzerConsulting.com", "description":"Local girl after 20 years in Hawaii. Mom of a crazy toddler. Owner of Schweitzer Consulting, a PR consultancy.", "protected":false, "verified":false, "followers_count":1159, "friends_count":953, "listed_count":49, "favourites_count":19,

Page 24: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

16

"statuses_count":4296, "created_at":"Sat Aug 02 22:50:58 +0000 2008", "utc_offset":-36000, "time_zone":"Hawaii", "geo_enabled":true, "lang":"en", "contributors_enabled":false, "is_translator":false, "profile_background_color":"EDECE9", "profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/34245337\/twilk_background.jpg", "profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/34245337\/twilk_background.jpg", "profile_background_tile":true, "profile_link_color":"088253", "profile_sidebar_border_color":"D3D2CF", "profile_sidebar_fill_color":"E3E2DE", "profile_text_color":"634047", "profile_use_background_image":true, "profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000211101091\/04050c00d0da6c74b318be1e34f8a38d_normal.jpeg", "profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000211101091\/04050c00d0da6c74b318be1e34f8a38d_normal.jpeg", "profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/15705560\/1368656596", "default_profile":false, "default_profile_image":false, "following":null, "follow_request_sent":null, "notifications":null}, "geo":{ "type":"Point", "coordinates":[21.295619,-157.847390]}, "coordinates":{"type":"Point","coordinates":[-157.847390,21.295619]}, "place":{"id":"c47c0bc571bf5427", "url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/c47c0bc571bf5427.json", "place_type":"city", "name":"Honolulu", "full_name":"Honolulu, HI", "country_code":"US", "country":"United States", "bounding_box":{"type":"Polygon", "coordinates":[[[-157.950476,21.254837],[-157.950476,21.38505],[-157.648702,21.38505],[-157.648702,21.254837]]]}, "attributes":{}}, "contributors":null, "retweet_count":0, "favorite_count":0, "entities":{"hashtags":[], "trends":[], "urls":[], "user_mentions":[{ "screen_name":"CrankyDad", "name":"Mike Gordon", "id":19426562, "id_str":"19426562",

Page 25: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

17

"indices":[0,10]}], "symbols":[]}, "favorited":false, "retweeted":false, "possibly_sensitive":false, "filter_level":"medium", "lang":"en"}

2.2 Storage in HDFS Filesystem

HDFS is a distributed file system with master/slave architecture and built in Hadoop

platform. It is designed for storing large files which could be hundreds of megabytes,

gigabytes and petabytes in size; and running on clusters of computers which could be

inexpensive and not necessary to be a highly reliable commodity hardware [9].

Figure 2.9 shows the HDFS architecture. In a cluster, the HDFS are consisted of a single

NameNode (the master) and a cluster of DataNodes (the slaves). The NameNode hosts

the files system index which is in the form of namespace image and edit log. It knows

and manages the DataNodes from which the NameNode is constructed when the system

starts. DataNodes store the data of the filesystem and retrieve blocks when the

NameNode tells them to. They report their status to the NameNode periodically. There is

also a secondary NameNode which produce snapshot of the primary NameNode’s

memory structures to avoid the problems brought by file system corruption [9][14][15].

Page 26: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

18

Figure 2.9 Architecture of HDFS

The HDFS clusters are setup at the beginning of the process and then we transfer the

collected data sets from the local system to HDFS for the future sentiment analysis.

Figure 2.10 shows the process of the how to store datasets into HDFS.

Figure 2.10 HDFS workflow

1. HDFS Setup: we need to install the Hadoop platform on a cluster of servers,

configure the files of NameNode and DataNodes. We execute the following

command to setup the system:

Java1.6.0_30 was installed on the cluster.

Page 27: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

19

hdp@hadoop:/usr/lib/jvm/default-java$ sudo chmod u+x jdk-6u30-linux-x64.bin hdp@hadoop:/usr/lib/jvm/default-java$ sudo ./jdk-6u30-linux-x64.bin hdp@hadoop:/usr/lib/jvm/default-java$ sudo chmod u+x jre-6u30-linux-x64.bin hdp@hadoop:/usr/lib/jvm/default-java$ sudo ./jre-6u30-linux-x64.bin

SSH was installed on the cluster.

hdp@hadoop:~$ sudo apt-get install ssh hdp@hadoop:~$ sudo apt-get install rsync hdp@hadoop:~$ sudo /etc/init.d/ssh start Rather than invoking init scripts through /etc/init.d, use the service(8) utility, e.g. service ssh start Since the script you are attempting to invoke has been converted to an Upstart job, you may also use the start(8) utility, e.g. start ssh hdp@hadoop:~$ ps -ef | grep sshd root 3700 1 0 15:07 ? 00:00:00 /usr/sbin/sshd -D hadoop 4071 2685 0 15:18 pts/1 00:00:00 grep --color=auto sshd hdp@hadoop:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: 15:49:81:02:71:55:a9:a0:9d:a8:e6:4d:c1:00:ae:65 hadoop@hadoop The key's randomart image is: +--[ RSA 2048]----+ |.. oo...+=+ | |. . .o . o. | | .Eo + + .. | |.o = o .. | |. . . S | | o . | | o o | | . . | | | +-----------------+ hdp@hadoop:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Install Hadoop on the cluster and config the Hadoop configuration file:

export JAVA_HOME=/usr/local/java/jdk1.6.0_45 into hadoop-env.sh export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:/usr/local/hadoop/bin

Page 28: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

20

We config core-site.xml, hdfs-site.xml, and mapred-site.xml respectively. The network

interface is setup by connecting each server via a single hub. We assign the IP address

192.168.123.104 to the master machine and 192.168.123.118 and 192.168.123.113 to the

slave machines respectively.

2. Initialize the system: the HDFS filesystem is formatted via NameNode. The

following command is executed:

hdp@master:/usr/local/hadoop$ bin/hadoop namenode -format

Start the system:

hdp@master:/usr/local/hadoop$ bin/start-all.sh

3. Transfer local data sets into HDFS:

hdp@master:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/inputTxt1 /user/hdp/inputTxt1

4. MapReduce calculation

hdp@master:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \ > -file /home/hdp/MyMapper.py -mapper /home/hdp/ MyMapper.py \ > -file /home/hdp/MyReducer.py -reducer /home/hdp/ MyReducer.py \ > -input /user/hdp/inputTxt1/* -output /user/hdp/inputTxt1-output

Page 29: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

21

CHAPTER 3 SENTIMENT ANALYSIS OF TWEETS

IN HADOOP SYSTEM

Sentiment analysis is the field of research that identifies and extracts subjective

information from written language. It is also called opinion mining which is aiming to

analyze people’s attitudes, emotions and opinions and classify the polarity of a given text.

Sentiment analysis usually classifies the given text into two classes, positive and negative

[16]. The proposed process of sentiment of analysis of Tweets is described in Figure 3.1.

Figure 3.1 The process of sentiment analysis

The first step is collecting dataset from Twitter Database. If there is not an expert who

could tell us which the most informative fields are, then we could use the brute-force

Page 30: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

22

method to gathering everything that may contains relevant features can be isolated.

However, the dataset collected by using the brute force method may miss useful

information and contains too many noises, so we need to define the keywords of

classifier. The second step is the definition of classifier keywords and data preparation.

Keywords selection in these dataset reduces data size and removes many irrelevant and

redundant features and thus reduces noises. Tweets filtered by keywords are processed

more effectively and faster by using data mining algorithm. In sum, a good selection of

keywords of classifier contributes to better analyze results.

3.1 Algorithm Selection

It is very important to choose a specific algorithm for sentiment analysis. For the text

classification problem, there are three methods that can be applied, e.g. Decision Tree,

Naive Bayes classification, and Support Vector Machines (SVM) [3]. In the following

section, we will introduce and compare them then propose the method we use.

3.1.1 Decision Trees

Decision trees are tree-like graphs that are used to classify instances by using a specific

sorting algorithm and to help to reach a goal. A decision tree uses decision nodes to test

attributes of an instance described by attribute values to be classified, and each tree

branches corresponds to attribute value represented by tree node. Each leaf node of

decision tree represents a classification goal. The classification is started from root node,

sorted based on the attribute values, and end at leaf nodes. Figure 3.2 shows an example

of a decision tree [3].

Page 31: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

23

Figure 3.2 An example of a decision tree.

Decision tree method is simple to understand and easy to implement. A general pseudo-

code for building a decision tree for sentiment analysis is showed as follows [3]:

Check for base cases Create a node r for the tree For each Tweet in Tweets do: If Tweet does not contain keywords, discard the Tweet. If Tweet contains keywords, do: add a new tree branch below r, corresponding to the test if keywords are positive then: label the Tweet “Positive attitude” Else add a new tree branch below, corresponding to the test if keywords are negative then: label the Tweet “Negative attitude” Else label the Tweet “Neutral attitude”

Page 32: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

24

3.1.2 Naive Bayes Classifiers

Assume that there are two classes of keywords: w1=positive, w2 = negative, and a set of

sentiments words in Tweet is represented as T. Define following symbols:

p(wj|T) is the probability of class

wj, given that we have observed T.

Bayesian classifiers use Bayes theorem, which is described as follows [3]:

||

where p(wj | T) is probability of instance T being in class wj

p(T | wj) is probability of generating instance T given class wj

p(wj ) is probability of occurrence of class wj

p(T) is probability of instance T occurring.

In order to classify T’s attitude as positive and negative, the probabilities of p(w1 | T) and

p(w2 | T) are compared and the larger probability event indicates that the class sentiment

is more likely to happen.

We input n sentiment words in a Tweet T = {t1, t2, …, tn}. When ti is a positive word ti

equals to 1 and when ti is a negative word ti equals to 2.

We assume all ti are probability independent and there exist k positive words in T, and the

following formulas are existence:

p(w1) = p(w2) = 0.5

p(ti = 1|w1) >> p(ti = 2|w1)

p(ti = 2|w2) >> p(ti = 1|w2),

Page 33: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

25

p(ti = 1|w1) = p(ti = 2|w2) = p >> 0.5

Since

| |

| | 1| 2|

1

Similarly,

| 1

Thus,

||

|

||| 1

In sum, the classifier result depends on the number of positive words and negative words.

For example, an input Tweet is: “Lovely turtle, beautiful fish, and bad weather, but still a

fancy trip.” The sentiment polarity in this tweet is shown in following table.

Table 3.1 Words with sentiment polarity in tweet

Page 34: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

26

Since

||

then

|

3111111

0.273

|

1111111

0.091

Because p(positive|Tweet) is larger than p(negative|Tweet), we can deduce the result that

is p(w1|T) is larger than p(w2|T). Therefore, the sampled Tweet is labeled as positive. As

Naive Bayes classifier assumes attributes have strong independent distributions, the

estimate is:

| | | ⋯ |

The Naive Bayes classifiers can be represented as directed acyclic graphs which have one

unobserved node as parent and several observed nodes as children with strong

independence assumptions among them [3].

Page 35: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

27

p(t |wj)

p(t1|wj) p(t2|wj) p(tn|wj)

Figure 3.3 Naive Bayes classifier represented by graph

A general pseudo-code for Naive Bayes classifier for sentiment analysis is showed as

follows [3]:

For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate

3.1.3 Support Vector Machines

Support vector machines (SVMs) are supervised machine learning models which can be

used for data analysis and classification. A hyperplane is constructed by a SVM can be

used for classification. To achieved the best classification performance, we need to find

the maximum margin which means either side of the a hyperplane has a largest distance

to the corresponding nearest data point, therefore, reduced an upper bound on the

expected generalization error [3].

Suppose some given linearly separable data points which can be separated into two

classes by hyperplane. There may be many hyperplanes that can classify the data points

Page 36: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

28

into two classes. One reasonable choice is the best hyperplane which represent the largest

separation—maximum margin. Figure 3.4 shows the Maximum-margin and margins for

an SVM.

Figure 3.4 Example of SVM maximum margin and margin

If the data sets are linearly separable, we select two hyperplanes between which the

distance is maximized. The area bounded by them is margin where there is not data

points located. Therefore, there is a pair (w, b) meets the following inequations [3]:

1,

1,

where xi is a n-dimensional vector

Page 37: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

29

w is the normal vector to the hyperplane

b is the offset

The two constraints can be rewritten as:

∙ 1

When we linearly classify the two classes, a best hyperplane can be found by solving the

quadratic programming optimization problem:

,12‖ ‖

subject to

∙ 1, for i = 1, … , n

The data points lying on the margin and satisfying ∙ 1 are support vector

points of which the linear combination represents the solution. (See Figure 3.5)

A general pseudo-code for SVMs is illustrated in the follow process [3].

1) INTRODUCE POSITIVE LAGRANGE MULTIPLIERS Α, ONE FOR EACH OF THE INEQUALITY

CONSTRAINTS (1). THIS GIVES LAGRANGIAN:

L12‖w‖ α y w ∙ x b 1

2) MINIMIZE LP WITH RESPECT TO W,B.

3) COMPUTE QUADRATIC PROGRAMMING SOLUTION W, B

4) IN THE SOLUTION, THOSE POINTS FOR WHICH α 0 ARE CALLED “SUPPORT VECTORS”

Page 38: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

30

3.2 Sentiment Analysis of Tweets

3.2.1 Extract Feature from Tweets

We extract features from tweets collected by a Python program for the future sentiment

classifying. The process of parsing the Tweet post and obtaining unigrams is as follows:

Decode: the datasets collected from API comes in JSON, then they are decoded

into Python data structure for future process. (e. g. JSON: [{"text": "tweet",

"truncated": false, "test": [6,14]}], Python: [{u'text': u'tweet', u'truncted': False,

u'test': [6, 14]}]).

Filtering: we extract text element (tweet content) from tweet which is in Python

data structure (e.g. “Everyone in Hawai‘i is so nice.”). And then convert the text

into lower case (e.g. everyone in Hawai‘i is so nice.).

Tokenization: We parse the data by splitting it by space. We encode text in UTF-

8 to get rid of Unicode errors and replace the punctuation in text.

3.2.3 Classifier

Since Naive Bayes is fast, space efficient, and not sensitive to irrelevant features, in this

research we used the Naive Bayes classifier which is based on Bayes’ theorem (Anthony

J, 2007) in this study.

|∙ |

where w is a sentiment word, T is a Twitter message [3].

Page 39: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

31

Bayes’s theorem is based on strong independence assumptions. Therefore, the

probabilistic model for a classifier can be described as:

| |

| |

∏ | ∏ |

Comparing the probabilities P(positive|T) and P(negative|T), the larger probability

indicates that the class label value has a higher probability to be actual label. If R is larger

than 0, then predict positive attitude is more likely to be true, otherwise, predict negative

attitude has more likely to be true.

During the sentiment analysis, the Naive Bayes classifier classifies a Tweet into a

positive class or a negative class by comparing the words in each Tweet. Each word will

be labeled with “positive” and “negative” coming from the lexicon. In the Naive Bayes

classification, the number of sentiment words is counted. If more positive words are used

than negative in a Tweet, then the Tweet could be labeled as positive, otherwise if less

positive words presented in a Tweet than negative ones, the Tweet could be labeled as

negative. A neutral label word is ignored in this study since it contains no valuable

information for sentiment analysis.

The algorithm judges the polarity of the text in the Tweet by checking the words in the

Tweet. At last, the algorithm output the individual’s view. Figure 3.5 shows the workflow

of sentiment analysis [17].

Page 40: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

32

Figure 3.5 The workflow of opinion mining

3.3 Run on Hadoop

3.3.1 Hadoop MapReduce and HDFS

Hadoop has a master-slave architecture which is consisted of HDFS and MapReduce [14].

The big Twitter datasets are stored in HDFS from which the data is read for processing

and the computational layer’s job is done by MapReduce [18].

The MapReduce master is responsible for organizing where computational work should

be scheduled on the slave nodes. The HDFS master is responsible for partitioning the

Page 41: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

33

storage across the slave node and keeping track of where data is located [18]. Figure 3.6

shows Hadoop MapReduce and HDFS architecture.

… ..

.

Figure 3.6 The Hadoop MapReduce and HDFS architecture

Page 42: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

34

MapReduce breaks the sentiment analysis processing into map and reduce phase which is

executed by MyMapper.py and MyReducer.py respectively. The map phase output key-

value pairs. After being sorted by Unix build in sort program, the key-value pairs will be

process by the reduce phase and then write out the results which are stored in HDFS.

The following three steps and figure 3.7 described how MapReduce process is:

1. Map process: the datasets are split based on distinct keys and values.

2. Shuffle and sort process: datasets are shuffled and sorted based on the keys into

some logically order.

3. Reduce process: the data flows input into reduce process are output from previous

procedure are grouped by keys and applied some functions.

Figure 3.7 A client submit a job to MapReduce [18]

3.3.2 MapReduce Functions

Hadoop Streaming provided by Hadoop distribution is a utility that allows us to create

and run Map/Reduce jobs with Python script [23]. It helps us passing data between our

map and reduce functions. Since it allows us to use standard input and standard output,

Page 43: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

35

we write our map and reduce function by Python and read input data using Python’s

sys.stdin and print the output data using Python’s sys.stdout [9].

The function MyMap.py read data from STDIN, split it into words and pass them line by

line to the STDOUT. The Map script output key-value pairs which are not sorted. The

intermediate sort work is done by the sort program built in UNIX-based systems. After

being sorted by key, the sorted output key-value a pairs will be read in line by line by

MyReducer.py script through standard input STDIN and write its final result to standard

output STDOUT [9].

Page 44: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

36

1. Mapper: figure 3.8 shows the map flowchart

Figure 3.8 The Mapper flowchart

Page 45: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

37

2. Reducer: figure 3.9 shows the reduce flowchart.

Figure 3.9 The reducer flowchart

Page 46: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

38

CHAPTER 4 EXPERIMENTS AND RESULTS

In this section, we present experiments and results for two classification tasks: sentiment

classification for Scottish independence vote: positive vs negative and sentiment

classification for Hawai‘i tourism spot: positive vs negative. For each of the sentiment

classification, we follow the procedures described in figure 4.1 below and Naive Bayes

classifier is applied to classify the datasets into positive and negative class.

Figure 4.1 Data analysis process

4.1 Scottish Independence Vote Analysis

The Scottish independence vote was a referendum on Scottish independence which took

place in Scotland on 18 September 2014. The voters answered “Should Scotland be an

independent country?” with “Yes” or “No” to decide whether Scotland should be

independent [19].

We extracted tweets from Twitter for opinion mining to predict the result of voting. To

make sure all the data sets we collected from Twitter refer to the Scotland independent

vote, we used keywords concerning the event as search arguments. We extracted tweets

Page 47: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

39

via Twitter Stream API over frequent intervals, thus, we had the timestamp, author and

tweet text for opinion evaluation. About one million tweets were gathered over a period

of ten days around the Scottish Independence vote date. Since the independence polling

took place on 18 September, we extracted tweets from 11 September to 20 September for

sentiment analysis.

Figure 4.2 The curve for tweets over collecting period based on different keywords

Figure 4.2 shows the time series trend in the amount of tweets for Scottish polling over

the collecting period. We can observe that the busiest time for the voting is at September

18 2014 which is reasonable since the event happened at that day. After September 18,

there are less and less people discussed the event since the polling process was ended,

thus, less and less tweets concerning the topic could be collected and the curves come

down.

09/11/14 09/12/14 09/13/14 09/14/14 09/15/14 09/16/14 09/17/14 09/18/14 09/19/14 09/20/140

2

4

6

8

10

12

14x 10

4

Date

Tw

eets

Am

ount

Scotland

Scottish

VoteIndependence

Independent

Page 48: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

40

Figure 4.3 Scottish independence vote polarity values

Figure 4.3 displays the amount of tweets about the people attitudes over time. As we can

read from the figure for the independent vote most tweets about the topic were published

via Twitter and many twitter uses have neutral attitude comparing to positive and

negative attitude. .

09/11/14 09/12/14 09/13/14 09/14/14 09/15/14 09/16/14 09/17/14 09/18/14 09/19/14 09/20/140

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

5

Date

Tw

eets

Am

ount

Total

PostiveNegative

Neutral

Page 49: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

41

Figure 4.4 The attitude distribution based on different keywords

The pie chart A, B, C, D, and E of the Figure 4.4 shows the attitude distribution based on

different keywords.

16%

29%

55%

Postive

Negative

Neutral

13%

27%

60%

Postive

NegativeNeutral

21%

33%

46%

Postive

Negative

Neutral

6%

21%

73%

Postive

NegativeNeutral

3%

84%

13%

Postive

NegativeNeutral

E

C D

A B

Page 50: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

42

Figure 4.5 Distribution of authors’ political standpoint toward Scottish Independence

vote: values of positive vs negative

Figure 4.5 shows the authors’ political standpoint toward Scottish independence vote.

The x axis represents the time period over the ten days, thus, there are 240 hours totally.

The y axis represents the degree of authors’ attitude. The blue points represent positive

results standing for supporting independence of Scotland while negative results are

marked by red points standing for opposing to the Scotland independence. As we can

observe at Sep 18, the peak point appeared.

4.2 Some Hawai‘i Tourism Sites Analysis

Hawai‘i islands which are Hawai‘i, O‘ahu, Maui, Kaua‘i, and Lāna‘i are located in the

Pacific Ocean and have significant tourism [24]. In 2013, according to Hawai‘i

government 2013 annual report, there were over 8 million visitors to the Hawaiian

Islands with expenditures of over $15 billion [25]. The most popular times for tourist are

e summer months and major holidays, therefore, our tweets collecting period is from

August 23 to September 20, 2014. In this study, we mainly collect data concerning these

topics: Hawai‘i—the name of the islands, Waikīkī—well known for Waikīkī beach which

20 40 60 80 100 120 140 160 180 200 220 240-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Page 51: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

43

is the most popular beach of O‘ahu; Diamond head—the name of a volcanic and a major

tourist attraction on O‘ahu; Hanauma bay—famous for snorkeling; and Hawai‘i

airlines—the largest airline in Hawai‘i [26][27][28][29].

Figure 4.6 shows the distribution for attitude polarity from August 23 to September 20

periods based on keyword “Hawai‘i”. The blue circles in the figure represent the positive

attitude degree, the red stars represent negative attitude, and the pink points represents the

average of positive and negative. As we can observed, all the average points are above

zero line over the collecting period, thus, the authors of Twitter have positive attitude

comments on Hawai‘i.

Figure 4.6 Distribution for attitude polarity over collecting period based on keyword

“Hawai‘i”

5 10 15 20 25-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05

Positive

Negative

Average

Page 52: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

44

Figure 4.7 shows the distribution for attitude polarity from August 23 to September 20

periods based on keyword “Waikīkī”. We can observe that the pink points which are the

average of positive and negative values are above zero line. Therefore, from the figure,

we can conclude that authors have positive attitude toward Waikīkī.

Figure 4.7 Distribution for attitude polarity over collecting period based on keyword

“Waikīkī”

Figure 4.8 describes the distribution for attitude polarity from August 23 to September 20

periods based on keyword “Diamond head”. People’s average attitude is positive toward

Diamond head, since the average values represented by pink points are above the zero

line.

5 10 15 20 25

-0.1

-0.05

0

0.05

0.1

0.15

Positive

Negative

Average

Page 53: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

45

Figure 4.8 Distribution for attitude polarity over collecting period based on keyword

“Diamond head”

Figure 4.9 shows the distribution for attitude polarity from August 23 to September 20

periods based on keyword “Hanauma bay”. The average values are above zero line, thus,

authors have positive comments on Hanauma bay.

5 10 15 20 25-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Positive

Negative

Average

Page 54: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

46

Figure 4.9 Distribution for attitude polarity over collecting period based on keyword

“Hanauma bay”

4.3 Performance Environment

In this experiment, we use one server as master, and two servers as slaves, on which we

installed Ubuntu system and Hadoop models. The environment of experiment is

described in Table 4.1 below.

5 10 15 20 25-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Positive

Negative

Average

Page 55: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

47

Table 4.1 The environment of the experiment

Page 56: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

48

CHAPTER 5 CONCLUSION AND OPEN ISSUES

This study presents a method to collect datasets which is concerning some specific topics

from Twitter database via Twitter API. We extracts features from Tweets and use Naive

Bayes classifier separate the data into two classes: positive and negative for opinion

evaluation toward some topics and issues. In this study, we store original dataset in

HDFS filesystem and analyze the datasets using Hadoop MapReduce model. We

visualized analyzing results by using Matlab. The experiment results prove that the

present method performs efficient.

Although this thesis evaluate the views of authors of Twitter: predict the Scottish

independence vote result and analyze tourists attitude toward some popular tourist

attractions in Hawai‘i, there are many open issues that still require further investigation

and research work. In this paragraph some of the open issues that are worth of attention

in relation to this thesis work are discussed. This thesis uses Naive Bayes classifier for

classification, in the future work we may modify it to improve its performance or try

other classifier to overcome the independence assumption.

Page 57: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

49

REFERENCES

[1] Matthew A. Russell, Mining the Social Web, O’Reilly, 2011

[2] http://www.internetlivestats.com/twitter-statistics

[3] S. B. Kotsiantis, Supervised Machine Learning: A Review of Classification

Techniques, Proceedings of the 2007 conference on Emerging Artificial Intelligence

Applications in Computer Engineering: Real Word AI Systems with Applications in

eHealth, HCI, Information Retrieval and Pervasive Technologies, pp. 3-24, June 2007.

[4] Alexander Pak, Patrick Paroubek, Twitter as a Corpus for sentiment analysis and

opinion mining, LREC 2010, Seventh International Conference on Language Resources

and Evaluation, May 2010.

[5] Predicting the Future with Social Media, 2010 IEEE/WIC/ACM International

Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp

492 – 499, 2010.

[6] Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, Rebecca Passonneau,

Sentiment Analysis of Twitter Data, LSM '11 Proceedings of the Workshop on

Languages in Social Media, pp.30-38, June 2011.

[7] Hsiang Hui Lek and Poo, D.C.C., Aspect-based Twitter Sentiment Classification,

2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI),

pp.366-373, November 2013

[8] http://bigdatanerd.wordpress.com/2012/02/12/hadoop-vs-rdbms-where-hadoop-cores-

over-rdbms/

[9] Tom White, Hadoop: The Definitive Guide, Third Edition, O’Reilly, 2011.

[10] https://dev.twitter.com/docs/api/1.1/post/statuses/filter

[11].https://dev.twitter.com/docs/auth/oauth

[12] https://dev.twitter.com/docs/streaming-apis/parameters#track

[13] http://json.org/

[14] http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

[15] http://en.wikipedia.org/wiki/Apache_Hadoop

[16] Bing Liu, Sentiment Analysis and Opinion Mining, Graeme Hirst, 2012.

Page 58: SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH · PDF filesentiment analysis of big social data with apache hadoop a thesis submitted to the graduate division of the university of hawai‘i

50

[17] Shamanth Kumar, Fred Morstatter, Huan Liu, Twitter Data Analytics, Springer,

2013

[18]. Alex Holmes, Hadoop in Practice, Manning Shelter Island, 2012

[19] http://en.wikipedia.org/wiki/Scottish_independence_referendum,_2014

[20] http://www.pythoncentral.io/introduction-to-tweepy-twitter-for-python/

[21] http://tweepy.readthedocs.org/en/v2.3.0/getting_started.html

[22] https://dev.twitter.com/overview/documentation

[23] http://hadoop.apache.org/docs/r1.2.1/streaming.html#Hadoop+Streaming

[24] http://en.wikipedia.org/wiki/Tourism_in_Hawaii

[25] http://dbedt.hawaii.gov/visitor/

[26] http://en.wikipedia.org/wiki/Waikiki

[27] http://en.wikipedia.org/wiki/Diamond_Head,_Hawaii

[28] http://www.tripadvisor.com/

[29] http://www.yelp.com/