Upload
buinhu
View
215
Download
1
Embed Size (px)
Citation preview
SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH APACHE HADOOP
A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE
UNIVERSITY OF HAWAI‘I AT MĀNOA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
ELECTRICAL ENGINEERING
DECEMBER 2014
By
Qiuling Kang
Thesis Committee:
Xiangrong Zhou, Chairperson
Galen Sasaki
Rui Zhang
Keywords: Twitter, Analysis Sentiment, Hadoop MapReduce, HDFS
ii
ACKNOWLEDGMENTS
I would like to express my appreciation to my advisor professor Zhou, for his patience and help
throughout my master program. Due to his generous and valuable suggestions, I can complete
my master project on time. I also would like to thank Professor Sasaki for his guidance during
my study in EE department and reviewing this manuscript. In addition, I would like to thank
Professor Zhang, for reviewing my thesis. Based on his insightful suggestions, my thesis could
be better.
iii
ABSTRACT
Twitter is a microblog service and is a very popular communication mechanism. Users of Twitter
express their interests, favorites, and sentiments towards various topics and issues they
encountered in daily life, therefore, Twitter is an important online platform for people to express
their opinions which is a key fact to influence their behaviors. Thus, sentiment analysis for
Twitter data is meaningful for both individuals and organizations to make decisions. Due to the
huge amount of data generated by Twitter every day, a system which can store and process big
data is becoming a problem. In this study, we present a method to collect Twitter data sets, and
store and analyze the data sets on Hadoop platform. The experiment results prove that the present
method performs efficient.
iv
TABLE OF CONTENTS
Acknowledgments........................................................................................................................... ii
Abstract .......................................................................................................................................... iii
Table of Contents ........................................................................................................................... iv
List of Tables ................................................................................................................................. vi
List of Figures ............................................................................................................................... vii
Chapter 1 Introduction .................................................................................................................... 1
1.1 Background ........................................................................................................................... 1
1.2 Motivation ............................................................................................................................. 2
1.3 Contribution of the Thesis ..................................................................................................... 5
1.4 Thesis Overview .................................................................................................................... 5
Chapter 2 Twitter Data Collection and Storage .............................................................................. 6
2.1 Twitter Data Collection ......................................................................................................... 6
2.1.1 Twitter API Introduction ................................................................................................ 6
2.1.2 Tweepy Introduction and Installation ............................................................................. 8
2.1.3 Collection Procedure .................................................................................................... 12
2.2 Storage in HDFS Filesystem ............................................................................................... 17
Chapter 3 Sentiment Analysis of Tweets in Hadoop System ....................................................... 21
3.1 Algorithm Selection ............................................................................................................ 22
3.1.1 Decision Trees .............................................................................................................. 22
3.1.2 Naive Bayes Classifiers ................................................................................................ 24
3.1.3 Support Vector Machines ............................................................................................. 27
3.2 Sentiment Analysis of Tweets ............................................................................................. 30
3.2.1 Extract Feature from Tweets ........................................................................................ 30
3.2.3 Classifier ....................................................................................................................... 30
3.3 Run on Hadoop .................................................................................................................... 32
3.3.1 Hadoop MapReduce and HDFS ................................................................................... 32
v
3.3.2 MapReduce Functions .................................................................................................. 34
Chapter 4 Experiments and Results .............................................................................................. 38
4.1 Scottish Independence Vote Analysis ................................................................................. 38
4.2 Some Hawai‘i Tourism Sites Analysis................................................................................ 42
4.3 Performance Environment ................................................................................................... 46
Chapter 5 Conclusion and Open Issues ........................................................................................ 48
References ..................................................................................................................................... 49
vi
LIST OF TABLES
Table 1.1 Traditional RDBMS compared to Hadoop ......................................................... 5
Table 3.1 Words with sentiment polarity in tweet ............................................................ 25
Table 4.1 The environment of the experiment .................................................................. 47
vii
LIST OF FIGURES
Figure 1.1 The increasing trends of tweets in recent years ................................................. 1
Figure 2.1 the process of requesting from REST APIs ....................................................... 7
Figure 2.2 Streaming APIs working procedure .................................................................. 8
Figure 2.3 Twitter dataset collection procedure ................................................................. 9
Figure 2.4 Create a Twitter application ............................................................................ 10
Figure 2.5 Create a Twitter application ............................................................................ 11
Figure 2.6 Obtain token .................................................................................................... 12
Figure 2.7 The procedure of data collection ..................................................................... 13
Figure 2.8 Sample of collected files ................................................................................. 15
Figure 2.9 Architecture of HDFS...................................................................................... 18
Figure 2.10 HDFS workflow ............................................................................................ 18
Figure 3.1 The process of sentiment analysis ................................................................... 21
Figure 3.2 An example of a decision tree. ........................................................................ 23
Figure 3.3 Naive Bayes classifier represented by graph ................................................... 27
Figure 3.4 Example of SVM maximum margin and margin ............................................ 28
Figure 3.5 The workflow of opinion mining .................................................................... 32
Figure 3.6 The Hadoop MapReduce and HDFS architecture ........................................... 33
Figure 3.7 A client submit a job to MapReduce ............................................................... 34
Figure 3.8 The Mapper flowchart ..................................................................................... 36
Figure 3.9 The reducer flowchart ...................................................................................... 37
Figure 4.1 Data analysis process ....................................................................................... 38
Figure 4.2 The curve for tweets based on different keywords .......................................... 39
Figure 4.3 Scottish independence vote polarity values ..................................................... 40
Figure 4.4 The attitude distribution based on different keywords .................................... 41
Figure 4.5 Values of positive vs negative ......................................................................... 42
Figure 4.6 Distribution for attitude polarity based on keyword “Hawaii” ....................... 43
Figure 4.7 Distribution for attitude polarity based on keyword “Waikiki” ...................... 44
Figure 4.8 Distribution for attitude polarity based on keyword “Diamond head” ............ 45
viii
Figure 4.9 Distribution for attitude polarity based on keyword “Hanauma bay” ............. 46
1
CHAPTER 1 INTRODUCTION
1.1 Background
Microblogging websites have become one of major source information. Twitter is one
such popular communication microblog which is an online social networking platform
that allows people to publish messages to express their interests, favorites, opinions, and
sentiments towards various topics and issues they encountered in their daily life. The
messages are called tweets which are real-time and at most 140 characters per one [1].
There are about 200 billion tweets per year, 500 million tweets per day, 350,000 tweets
per minute, and 6,000 tweets per second are published [2]. Figure1.1 shows the
increasing trends of twitter in recent years. Such a huge amount of data can be efficiently
used for social network studies and analysis to gain useful and meaningful results.
Figure 1.1 The increasing trends of tweets in recent years [2]
There is previous research on sentiment analysis of Twitter data. Pak and Paroubek (2010)
perform linguistic analysis of the collected tweets and they show the method to build a
sentiment classifier using training data. [4]. Sitaram Asur and Bernardo A.Huberman
2
(2010) proved that social media content can be utilized to predict real world performance.
They built a linear regression model for forecasting the box-office revenues of movies[5].
Apoorv Agarwal and Boye Xie introduce POS-specific prior polarity features and
designed new tree for the tree kernel based model. (2011)[6]. Hsiang Hui Lek and Danny
C.C. Poo (2013) proposed an aspect-based sentiment classification which improve the
existing tweet level classifiers [7].
Classification techniques are fundamental to analyze sentiment of social data. S.B.
Kotsiantis(2007) review some recent classification techniques. They discussed and
compared several supervised learning algorithms: logical based algorithm such as
decision trees, rule based algorithms[3]; perceptron –based techniques, such as single
layered perceptrons and multilayered perceptrons; Radial Bassis Function (RBF)
networks; statistical learning algorithms such as Naive Bayes classifier and Bayesian
networks; instance-based learning; and Support Vector machines.
1.2 Motivation
Since the number of internet users of social networking platforms and services grows fast,
more and more data from these platforms can be used for data mining studies. For
example, government may be interested in people’s attitude toward the vote. They may
prefer to predict the vote result like the question in the following [4]:
1. Could the new policy get the most people’s support?
2. How positive (or negative) are people about the new policy?
3. What kind of people has the most influence on the result?
Also, the local tourism company may interested in which place is popular among tourists.
They may like to know the following questions’ answer [4]:
1. Which is most visited tourist attractions for tourists?
3
2. How positive (or negative) are people about the tourist attraction?
3. Which time do people prefer to travel outside in a day?
In this thesis, we show how we use the datasets from microblogging platform to do data
mining. We collected big datasets from the Twitter database and analyzed them. There
are several reasons we use Twitter data set for opinion mining purpose [4]:
1. Valuable data source: twitter is social network platform used by various people to
post their opinions on various topics and discuss current issues.
2. Sufficient data: the volume of tweets grows at high rates, sufficient data could be
gathered for data mining.
3. Variety source of users: internet users could be from variety groups of people, for
example, researchers, politicians, students, farmers, workers, artists, and so on.
We collected more than one million tweets published from Twitter. They are separated
into two sets of tweets:
1. Two thirds of tweets concerning the topic of Scotland independence vote event in
September 2014.
2. One third of tweets talks about the tourism in Hawai‘i of U.S.
As we are going to calculate huge amount of datasets, the problem of how to store large
datasets and improve the performance of calculating has become significant and cannot
be ignored. There are several reasons that we use Hadoop System instead of Rational
Database Management System(RDBMS) this study:
Access speed [8][9]
Although the storage capacity of a single disk is considerable due to the significant
increase of hard drives’ development over the years, the access speed of data from the
disk has not kept up. It usually cost a lot of time to read data from a hard drive. However,
4
MapReduce model built on Hadoop is effective when unstructured data are combined
from different nodes for merging and sorting.
Data duplication [8][9]
It is necessary to duplicate data to distinct storage systems to avoid the problems brought
by hardware failure. While, we cannot use RDBMS with many disks to do the large scale
calculation, since if seek time dominates the data access time, data access will take longer
to read and write large portions of the dataset than streaming through it. That is to say, if
we update large portions of database, the RDBMS works less efficient than MapReduce
built on Hadoop. Distributed File System (HDFS) can store data in distributed systems
which duplicate data sets on a cluster of hard drives to avoid the failure coming with one
disk failure.
Linear Scalability [8][9]
When Gigabytes of structured data is computed, RDMBS need to be highly integrated.
However, Hadoop models can store very large datasets and process Petabytes of
structured, semi-structured or unstructured data with linear scaling and low integrity. If
we increase the number of clusters, the speed of processing data is increased
proportionally. It does not work for SQL queries. Table 1.1 shows the comparison
between traditional RDBMS and Hadoop modules.
We show how to collect datasets from twitter database via Twitter API and preform
sentiment analysis of the collected datasets on Hadoop distributed system.
5
Table 1.1 Traditional RDBMS compared to Hadoop [9]
1.3 Contribution of the Thesis
The thesis presents a method to collect a huge amount of datasets which is concerning
some specific topics from Twitter database via Twitter API. This study extracts features
from Tweets and use sentiment classifier to classify the tweets into positive attitude and
negative attitude classes for analyzing people’s opinion toward specific topics and issues.
This study store and analyzes the datasets by using HDFS and Hadoop MapReduce
models respectively, which are more scalable and efficient than traditional RDBMS. The
experiment results prove that the present method performs efficient.
1.4 Thesis Overview
The rest of paper is organized as follows. A twitter sentiment analysis and research
background on Hadoop MapReduce is given in chapter 2. The Twitter data set collection
and storage is presented in Chapter 3, and a sentiment analysis on Hadoop system is
introduced in Chapter 4. In Chapter 5, we describe the experiment and results. Finally, we
conclude the thesis in Chapter 6 and indicate future work.
6
CHAPTER 2 TWITTER DATA COLLECTION AND
STORAGE
Millions of tweets are generated by Twitter users per day [4]. Through the Twitter API
(Application Programming Interface), researchers and developers can collect a large
public data set from Twitter database. Twitter provides two types of APIs for users to
access the Twitter data: REST APIs and Streaming APIs [22]. Users need to request the
information explicitly to retrieve tweets with REST APIs that allow users to access some
of the core primitives of Twitter including timelines, status updates, and user information.
While users can collect stream of public information continuously with Streaming APIs
that allow users to request for real-time large quantities with specific type of data filtered
by specified and tracked keyword, geographic area, user, or a random sample. As long as
a long lived connection is maintained, users can get a continuous stream of updates. In
this research, we retrieve Twitter data via Streaming APIs based on our research
objective – analyzing user sentiments about given topics that requires collecting twitter
messages published by users. We need to use the authentication method supported by
Twitter to make calls to Twitter’s APIs. Twitter use OAuth (Open Authentication) which
is an open standard for authentication to protected information. After obtaining the data
set, we store the data in HDFS.
2.1 Twitter Data Collection
2.1.1 Twitter API Introduction
Twitter has two types of APIs for user to access Twitter data: REST APIs and Streaming
APIs. The REST APIs do not require users keeping a persistent HTTP (Hypertext
Treansfer Protocol) connection open. User makes one or more requests to a web
appli
2.1 sh
Twitt
flowi
to the
serve
Twitt
Befor
parse
result
cation, and t
how the proc
F
ter offers thr
ing with pub
e view of a
ers accessed
ter Streamin
re the result
ed, filtered a
ts from the d
then the use
cess of reque
Figure 2.1 th
ree types of e
blic tweets; U
single user;
to many Tw
g APIs. Figu
is stored int
and/or aggre
data store [14
er will receiv
esting from R
e process of
endpoints fo
User streams
Site stream
witter users.
ure 2.2 show
to a data stor
egated first.
4].
7
ve the result
REST APIs
f requesting f
or Streaming
s are single-u
ms are multi-
In this rese
ws how the st
re, the Twee
To respond
ts to the use
[14].
from REST
g APIs: Publi
user streams
-user stream
earch, we co
treaming AP
ets input as a
d to user req
er’s initial re
APIs[14]
ic streams ar
which are c
ms which are
ollect real-tim
PIs work.
a streaming p
quests, the H
equest. Figur
re public dat
correspondin
e intended fo
me tweets vi
process that
HTTP querie
re
ta
ng
or
ia
is
es
2.1.2
Twee
It sup
secur
more
appli
perm
In ord
token
Twitt
vario
down
2 Tweepy
epy is open-s
pports acces
re its informa
secure; (2)
cation will
missions are e
der to get da
n first and th
ter API. Tw
us paramete
nloaded and
Figure 2.2
Introduct
sourced and
sing Twitter
ation. OAth
It conceal th
still work,
easily to be m
ata from Stre
hen install T
weepy API c
ers and retu
installed on
2 Streaming
tion and In
provides ac
r through OA
offers sever
he user’s pas
since the ap
managed [20
eaming APIs
Tweepy pack
lass provide
urn response
the Ubuntu
8
APIs workin
nstallation
cess to docu
Ath which is
ral benefits:
ssword; (3)
pplication do
0][21].
s, our applic
kage which
es access to
data. There
Linux syste
ng procedur
n
umented Twi
s the only w
(1) it can m
if the user c
oesn’t reply
cation should
is a Python
the twitter
efore a copy
em. After the
re [14]
itter API by
way adopted
make the user
changes the p
y on a passw
d obtain an O
library for
API method
y of Tweep
e collection
using Pytho
by Twitter t
r’ informatio
password, th
word; (4) th
OAuth acces
accessing th
ds that accep
py package
Python scrip
on.
to
on
he
he
ss
he
pt
is
pt
9
is run to collect data and store the dataset on the local server. The whole collecting
procedure is shown in Figure 2.3.
Figure 2.3 Twitter dataset collection procedure
In this paper, the Tweepy package which requires Python 2.5 or later has been installed in
Ubuntu Linux system. There are three steps to complete the installation process.
1. Download the tweepy package in local server
$ git clone git://github.com/tweepy/tweepy.git
2. Go into the tweepy file.
$ cd tweepy
3. Install tweepy using administration or root privilege.
$ python setup.py install
In order to start the collection process, a client application is registered and a new
application is also created with Twitter. We log in the portal, and then go to “My
Applications”. After filling the information shown in Figure 2.4, a new application can be
created. We use the application information to communicate with Twitter API to retrieve
data sets. As shown in the figure, we input the application name—“For Hadoop” for
“Name” blank, input “This application is used for test” in “Description” blank, input
placeholder since we do not have a URL in “Website” blank. At last, we check “Yes, I
agree” and click “Create your Twitter application” to complete the creating process.
10
Figure 2.4 Create a Twitter application
After creating the application, we generate the access token. As it is shown in Figure 2.5,
we click “Create my access token” at the end of form.
11
Figure 2.5 Create a Twitter application
Next, we get the access token and access token secret presented in Figure 2.6. All the
information we need to communicate with Twitter API are included in the figure, owner,
owner ID, API key, and API secret.
12
Figure 2.6 Obtain token
2.1.3 Collection Procedure
The collection procedure is presented in the Figure 2.7. It includes four steps as follows:
13
Set OAuth authentication with Username and
password
Start
Set request parameters
Set filter method
Access to twitter API and collecting data
End
Figure 2.7 The procedure of data collection
1. Set OAuth authentication with tokens using Tweepy: twitter utilizes OAuth to
provide authorized access to its API and requires all requests to use OAuth for
authentication [11]. In the following code, we show how to use Tweepy with
OAth to access the Twitter API.
1. consumer_key = "tCYTPMwiWLXyNBdCS9Ipg" 2. consumer_secret = "7BFXcq07s5y4YrjwjP6p3t4cYu0ojeTFG9vq98rE8" 3. access_token = "37188238-Nu3991UKfyVIjacGHNnxKmBykHj5W5zX0g89kN4k" 4. access_token_secret = "KuINYWTDE1fd5QVmRlVsMBmLTDdgMoq2MnyFmo4pG7gv1" 5. auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 6. auth.set_access_token(access_token, access_token_secret) 7. api = tweepy.API(auth)
14
We use the consumer_key, consumer_secret, access_token, and access_token_secret as
shown in Figure 2.6 to create the OAth access. Line5 and line 6 show how OAth process
works. In line 7, we create actual interface using authentication.
2. Set request parameters: in this dissertation, a list of key words and longitude,
latitude pairs are used to specify the Tweets that will be returned from the twitter
stream. I set the parameters in the following format:
track = [‘key words 1’, ‘key words 2’, ‘key words 3’] follow = [] geo_location = [-158.25, 21.31, -157.628, 21.71, -156.674, 20.59, -155.99, 21.017, -159.77, 21.89, -159.3, 22.24, -157.310, 21.057, -156.69, 21.21]
At least of one of the three parameters track, follow, and location should be specified.
The parameter “track” is a list of keywords to track. A list of phrases which are used to
determine what Tweets will be called back is separated by comma. A phrase can contain
one or several words which are separated by spaces must less than 60 bytes, inclusive
[10][12].
The parameter “follow” is a list of user IDs separated by commas to track. We can collect
the tweets of a specific users by setting the “follow” parameters [10][12].
The parameter “location” is a list of longitude, latitude pairs which can set a box of
bounding of geometric area to track. All the tweets in the area we set will be retrieved
[10][12]. For example with setting of the longitude and latitude pairs to bound the
location of Hawai‘i Islands we can track tweets from Hawai‘i area.
3. Use filter to collect a bunch of tweets matching the request parameters tweets. We
use filter() to pass parameters. We use filter() in the format of following codes:
stream.filter(track = track, follow = follow, locations = geo_location)
15
4. The tweets that match one or more filter parameters are returned and stored in the
local server. These tweets are encoded in JSON (JavaScript Object Notation)
which is a lightweight data-interchange format. This format is easy for humans to
read and write and for machines to parse and generate [13]. In this thesis, the
tweepy package is utilized for accessing the Streaming API to gather the tweet
back encoded in JSON Figure 2.8 presents part of Twitter data set we collected.
Figure 2.8 Sample of collected files
More than one million of tweets have been collected from Twitter API in this thesis. One
of the tweets is shown in the following:
{"created_at":"Mon Aug 11 18:50:45 +0000 2014", "id":498904369136680961, "id_str":"498904369136680961", "text":"@CrankyDad I need to visit you the next time I'm over there meeting with Jay and Sara.", "source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e", "truncated":false, "in_reply_to_status_id":498903812942209024, "in_reply_to_status_id_str":"498903812942209024", "in_reply_to_user_id":19426562, "in_reply_to_user_id_str":"19426562", "in_reply_to_screen_name":"CrankyDad", "user":{"id":15705560, "id_str":"15705560", "name":"JulieFord", "screen_name":"JulieFord808", "location":"Honolulu, Hawaii", "url":"http:\/\/www.SchweitzerConsulting.com", "description":"Local girl after 20 years in Hawaii. Mom of a crazy toddler. Owner of Schweitzer Consulting, a PR consultancy.", "protected":false, "verified":false, "followers_count":1159, "friends_count":953, "listed_count":49, "favourites_count":19,
16
"statuses_count":4296, "created_at":"Sat Aug 02 22:50:58 +0000 2008", "utc_offset":-36000, "time_zone":"Hawaii", "geo_enabled":true, "lang":"en", "contributors_enabled":false, "is_translator":false, "profile_background_color":"EDECE9", "profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/34245337\/twilk_background.jpg", "profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/34245337\/twilk_background.jpg", "profile_background_tile":true, "profile_link_color":"088253", "profile_sidebar_border_color":"D3D2CF", "profile_sidebar_fill_color":"E3E2DE", "profile_text_color":"634047", "profile_use_background_image":true, "profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000211101091\/04050c00d0da6c74b318be1e34f8a38d_normal.jpeg", "profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000211101091\/04050c00d0da6c74b318be1e34f8a38d_normal.jpeg", "profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/15705560\/1368656596", "default_profile":false, "default_profile_image":false, "following":null, "follow_request_sent":null, "notifications":null}, "geo":{ "type":"Point", "coordinates":[21.295619,-157.847390]}, "coordinates":{"type":"Point","coordinates":[-157.847390,21.295619]}, "place":{"id":"c47c0bc571bf5427", "url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/c47c0bc571bf5427.json", "place_type":"city", "name":"Honolulu", "full_name":"Honolulu, HI", "country_code":"US", "country":"United States", "bounding_box":{"type":"Polygon", "coordinates":[[[-157.950476,21.254837],[-157.950476,21.38505],[-157.648702,21.38505],[-157.648702,21.254837]]]}, "attributes":{}}, "contributors":null, "retweet_count":0, "favorite_count":0, "entities":{"hashtags":[], "trends":[], "urls":[], "user_mentions":[{ "screen_name":"CrankyDad", "name":"Mike Gordon", "id":19426562, "id_str":"19426562",
17
"indices":[0,10]}], "symbols":[]}, "favorited":false, "retweeted":false, "possibly_sensitive":false, "filter_level":"medium", "lang":"en"}
2.2 Storage in HDFS Filesystem
HDFS is a distributed file system with master/slave architecture and built in Hadoop
platform. It is designed for storing large files which could be hundreds of megabytes,
gigabytes and petabytes in size; and running on clusters of computers which could be
inexpensive and not necessary to be a highly reliable commodity hardware [9].
Figure 2.9 shows the HDFS architecture. In a cluster, the HDFS are consisted of a single
NameNode (the master) and a cluster of DataNodes (the slaves). The NameNode hosts
the files system index which is in the form of namespace image and edit log. It knows
and manages the DataNodes from which the NameNode is constructed when the system
starts. DataNodes store the data of the filesystem and retrieve blocks when the
NameNode tells them to. They report their status to the NameNode periodically. There is
also a secondary NameNode which produce snapshot of the primary NameNode’s
memory structures to avoid the problems brought by file system corruption [9][14][15].
18
Figure 2.9 Architecture of HDFS
The HDFS clusters are setup at the beginning of the process and then we transfer the
collected data sets from the local system to HDFS for the future sentiment analysis.
Figure 2.10 shows the process of the how to store datasets into HDFS.
Figure 2.10 HDFS workflow
1. HDFS Setup: we need to install the Hadoop platform on a cluster of servers,
configure the files of NameNode and DataNodes. We execute the following
command to setup the system:
Java1.6.0_30 was installed on the cluster.
19
hdp@hadoop:/usr/lib/jvm/default-java$ sudo chmod u+x jdk-6u30-linux-x64.bin hdp@hadoop:/usr/lib/jvm/default-java$ sudo ./jdk-6u30-linux-x64.bin hdp@hadoop:/usr/lib/jvm/default-java$ sudo chmod u+x jre-6u30-linux-x64.bin hdp@hadoop:/usr/lib/jvm/default-java$ sudo ./jre-6u30-linux-x64.bin
SSH was installed on the cluster.
hdp@hadoop:~$ sudo apt-get install ssh hdp@hadoop:~$ sudo apt-get install rsync hdp@hadoop:~$ sudo /etc/init.d/ssh start Rather than invoking init scripts through /etc/init.d, use the service(8) utility, e.g. service ssh start Since the script you are attempting to invoke has been converted to an Upstart job, you may also use the start(8) utility, e.g. start ssh hdp@hadoop:~$ ps -ef | grep sshd root 3700 1 0 15:07 ? 00:00:00 /usr/sbin/sshd -D hadoop 4071 2685 0 15:18 pts/1 00:00:00 grep --color=auto sshd hdp@hadoop:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: 15:49:81:02:71:55:a9:a0:9d:a8:e6:4d:c1:00:ae:65 hadoop@hadoop The key's randomart image is: +--[ RSA 2048]----+ |.. oo...+=+ | |. . .o . o. | | .Eo + + .. | |.o = o .. | |. . . S | | o . | | o o | | . . | | | +-----------------+ hdp@hadoop:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Install Hadoop on the cluster and config the Hadoop configuration file:
export JAVA_HOME=/usr/local/java/jdk1.6.0_45 into hadoop-env.sh export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:/usr/local/hadoop/bin
20
We config core-site.xml, hdfs-site.xml, and mapred-site.xml respectively. The network
interface is setup by connecting each server via a single hub. We assign the IP address
192.168.123.104 to the master machine and 192.168.123.118 and 192.168.123.113 to the
slave machines respectively.
2. Initialize the system: the HDFS filesystem is formatted via NameNode. The
following command is executed:
hdp@master:/usr/local/hadoop$ bin/hadoop namenode -format
Start the system:
hdp@master:/usr/local/hadoop$ bin/start-all.sh
3. Transfer local data sets into HDFS:
hdp@master:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/inputTxt1 /user/hdp/inputTxt1
4. MapReduce calculation
hdp@master:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \ > -file /home/hdp/MyMapper.py -mapper /home/hdp/ MyMapper.py \ > -file /home/hdp/MyReducer.py -reducer /home/hdp/ MyReducer.py \ > -input /user/hdp/inputTxt1/* -output /user/hdp/inputTxt1-output
21
CHAPTER 3 SENTIMENT ANALYSIS OF TWEETS
IN HADOOP SYSTEM
Sentiment analysis is the field of research that identifies and extracts subjective
information from written language. It is also called opinion mining which is aiming to
analyze people’s attitudes, emotions and opinions and classify the polarity of a given text.
Sentiment analysis usually classifies the given text into two classes, positive and negative
[16]. The proposed process of sentiment of analysis of Tweets is described in Figure 3.1.
Figure 3.1 The process of sentiment analysis
The first step is collecting dataset from Twitter Database. If there is not an expert who
could tell us which the most informative fields are, then we could use the brute-force
22
method to gathering everything that may contains relevant features can be isolated.
However, the dataset collected by using the brute force method may miss useful
information and contains too many noises, so we need to define the keywords of
classifier. The second step is the definition of classifier keywords and data preparation.
Keywords selection in these dataset reduces data size and removes many irrelevant and
redundant features and thus reduces noises. Tweets filtered by keywords are processed
more effectively and faster by using data mining algorithm. In sum, a good selection of
keywords of classifier contributes to better analyze results.
3.1 Algorithm Selection
It is very important to choose a specific algorithm for sentiment analysis. For the text
classification problem, there are three methods that can be applied, e.g. Decision Tree,
Naive Bayes classification, and Support Vector Machines (SVM) [3]. In the following
section, we will introduce and compare them then propose the method we use.
3.1.1 Decision Trees
Decision trees are tree-like graphs that are used to classify instances by using a specific
sorting algorithm and to help to reach a goal. A decision tree uses decision nodes to test
attributes of an instance described by attribute values to be classified, and each tree
branches corresponds to attribute value represented by tree node. Each leaf node of
decision tree represents a classification goal. The classification is started from root node,
sorted based on the attribute values, and end at leaf nodes. Figure 3.2 shows an example
of a decision tree [3].
23
Figure 3.2 An example of a decision tree.
Decision tree method is simple to understand and easy to implement. A general pseudo-
code for building a decision tree for sentiment analysis is showed as follows [3]:
Check for base cases Create a node r for the tree For each Tweet in Tweets do: If Tweet does not contain keywords, discard the Tweet. If Tweet contains keywords, do: add a new tree branch below r, corresponding to the test if keywords are positive then: label the Tweet “Positive attitude” Else add a new tree branch below, corresponding to the test if keywords are negative then: label the Tweet “Negative attitude” Else label the Tweet “Neutral attitude”
24
3.1.2 Naive Bayes Classifiers
Assume that there are two classes of keywords: w1=positive, w2 = negative, and a set of
sentiments words in Tweet is represented as T. Define following symbols:
p(wj|T) is the probability of class
wj, given that we have observed T.
Bayesian classifiers use Bayes theorem, which is described as follows [3]:
||
where p(wj | T) is probability of instance T being in class wj
p(T | wj) is probability of generating instance T given class wj
p(wj ) is probability of occurrence of class wj
p(T) is probability of instance T occurring.
In order to classify T’s attitude as positive and negative, the probabilities of p(w1 | T) and
p(w2 | T) are compared and the larger probability event indicates that the class sentiment
is more likely to happen.
We input n sentiment words in a Tweet T = {t1, t2, …, tn}. When ti is a positive word ti
equals to 1 and when ti is a negative word ti equals to 2.
We assume all ti are probability independent and there exist k positive words in T, and the
following formulas are existence:
p(w1) = p(w2) = 0.5
p(ti = 1|w1) >> p(ti = 2|w1)
p(ti = 2|w2) >> p(ti = 1|w2),
25
p(ti = 1|w1) = p(ti = 2|w2) = p >> 0.5
Since
| |
| | 1| 2|
1
Similarly,
| 1
Thus,
||
|
||| 1
In sum, the classifier result depends on the number of positive words and negative words.
For example, an input Tweet is: “Lovely turtle, beautiful fish, and bad weather, but still a
fancy trip.” The sentiment polarity in this tweet is shown in following table.
Table 3.1 Words with sentiment polarity in tweet
26
Since
||
then
|
3111111
0.273
|
1111111
0.091
Because p(positive|Tweet) is larger than p(negative|Tweet), we can deduce the result that
is p(w1|T) is larger than p(w2|T). Therefore, the sampled Tweet is labeled as positive. As
Naive Bayes classifier assumes attributes have strong independent distributions, the
estimate is:
| | | ⋯ |
The Naive Bayes classifiers can be represented as directed acyclic graphs which have one
unobserved node as parent and several observed nodes as children with strong
independence assumptions among them [3].
27
p(t |wj)
p(t1|wj) p(t2|wj) p(tn|wj)
Figure 3.3 Naive Bayes classifier represented by graph
A general pseudo-code for Naive Bayes classifier for sentiment analysis is showed as
follows [3]:
For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate
3.1.3 Support Vector Machines
Support vector machines (SVMs) are supervised machine learning models which can be
used for data analysis and classification. A hyperplane is constructed by a SVM can be
used for classification. To achieved the best classification performance, we need to find
the maximum margin which means either side of the a hyperplane has a largest distance
to the corresponding nearest data point, therefore, reduced an upper bound on the
expected generalization error [3].
Suppose some given linearly separable data points which can be separated into two
classes by hyperplane. There may be many hyperplanes that can classify the data points
28
into two classes. One reasonable choice is the best hyperplane which represent the largest
separation—maximum margin. Figure 3.4 shows the Maximum-margin and margins for
an SVM.
Figure 3.4 Example of SVM maximum margin and margin
If the data sets are linearly separable, we select two hyperplanes between which the
distance is maximized. The area bounded by them is margin where there is not data
points located. Therefore, there is a pair (w, b) meets the following inequations [3]:
1,
1,
where xi is a n-dimensional vector
29
w is the normal vector to the hyperplane
b is the offset
The two constraints can be rewritten as:
∙ 1
When we linearly classify the two classes, a best hyperplane can be found by solving the
quadratic programming optimization problem:
,12‖ ‖
subject to
∙ 1, for i = 1, … , n
The data points lying on the margin and satisfying ∙ 1 are support vector
points of which the linear combination represents the solution. (See Figure 3.5)
A general pseudo-code for SVMs is illustrated in the follow process [3].
1) INTRODUCE POSITIVE LAGRANGE MULTIPLIERS Α, ONE FOR EACH OF THE INEQUALITY
CONSTRAINTS (1). THIS GIVES LAGRANGIAN:
L12‖w‖ α y w ∙ x b 1
2) MINIMIZE LP WITH RESPECT TO W,B.
3) COMPUTE QUADRATIC PROGRAMMING SOLUTION W, B
4) IN THE SOLUTION, THOSE POINTS FOR WHICH α 0 ARE CALLED “SUPPORT VECTORS”
30
3.2 Sentiment Analysis of Tweets
3.2.1 Extract Feature from Tweets
We extract features from tweets collected by a Python program for the future sentiment
classifying. The process of parsing the Tweet post and obtaining unigrams is as follows:
Decode: the datasets collected from API comes in JSON, then they are decoded
into Python data structure for future process. (e. g. JSON: [{"text": "tweet",
"truncated": false, "test": [6,14]}], Python: [{u'text': u'tweet', u'truncted': False,
u'test': [6, 14]}]).
Filtering: we extract text element (tweet content) from tweet which is in Python
data structure (e.g. “Everyone in Hawai‘i is so nice.”). And then convert the text
into lower case (e.g. everyone in Hawai‘i is so nice.).
Tokenization: We parse the data by splitting it by space. We encode text in UTF-
8 to get rid of Unicode errors and replace the punctuation in text.
3.2.3 Classifier
Since Naive Bayes is fast, space efficient, and not sensitive to irrelevant features, in this
research we used the Naive Bayes classifier which is based on Bayes’ theorem (Anthony
J, 2007) in this study.
|∙ |
where w is a sentiment word, T is a Twitter message [3].
31
Bayes’s theorem is based on strong independence assumptions. Therefore, the
probabilistic model for a classifier can be described as:
| |
| |
∏ | ∏ |
Comparing the probabilities P(positive|T) and P(negative|T), the larger probability
indicates that the class label value has a higher probability to be actual label. If R is larger
than 0, then predict positive attitude is more likely to be true, otherwise, predict negative
attitude has more likely to be true.
During the sentiment analysis, the Naive Bayes classifier classifies a Tweet into a
positive class or a negative class by comparing the words in each Tweet. Each word will
be labeled with “positive” and “negative” coming from the lexicon. In the Naive Bayes
classification, the number of sentiment words is counted. If more positive words are used
than negative in a Tweet, then the Tweet could be labeled as positive, otherwise if less
positive words presented in a Tweet than negative ones, the Tweet could be labeled as
negative. A neutral label word is ignored in this study since it contains no valuable
information for sentiment analysis.
The algorithm judges the polarity of the text in the Tweet by checking the words in the
Tweet. At last, the algorithm output the individual’s view. Figure 3.5 shows the workflow
of sentiment analysis [17].
32
Figure 3.5 The workflow of opinion mining
3.3 Run on Hadoop
3.3.1 Hadoop MapReduce and HDFS
Hadoop has a master-slave architecture which is consisted of HDFS and MapReduce [14].
The big Twitter datasets are stored in HDFS from which the data is read for processing
and the computational layer’s job is done by MapReduce [18].
The MapReduce master is responsible for organizing where computational work should
be scheduled on the slave nodes. The HDFS master is responsible for partitioning the
33
storage across the slave node and keeping track of where data is located [18]. Figure 3.6
shows Hadoop MapReduce and HDFS architecture.
… ..
.
Figure 3.6 The Hadoop MapReduce and HDFS architecture
34
MapReduce breaks the sentiment analysis processing into map and reduce phase which is
executed by MyMapper.py and MyReducer.py respectively. The map phase output key-
value pairs. After being sorted by Unix build in sort program, the key-value pairs will be
process by the reduce phase and then write out the results which are stored in HDFS.
The following three steps and figure 3.7 described how MapReduce process is:
1. Map process: the datasets are split based on distinct keys and values.
2. Shuffle and sort process: datasets are shuffled and sorted based on the keys into
some logically order.
3. Reduce process: the data flows input into reduce process are output from previous
procedure are grouped by keys and applied some functions.
Figure 3.7 A client submit a job to MapReduce [18]
3.3.2 MapReduce Functions
Hadoop Streaming provided by Hadoop distribution is a utility that allows us to create
and run Map/Reduce jobs with Python script [23]. It helps us passing data between our
map and reduce functions. Since it allows us to use standard input and standard output,
35
we write our map and reduce function by Python and read input data using Python’s
sys.stdin and print the output data using Python’s sys.stdout [9].
The function MyMap.py read data from STDIN, split it into words and pass them line by
line to the STDOUT. The Map script output key-value pairs which are not sorted. The
intermediate sort work is done by the sort program built in UNIX-based systems. After
being sorted by key, the sorted output key-value a pairs will be read in line by line by
MyReducer.py script through standard input STDIN and write its final result to standard
output STDOUT [9].
36
1. Mapper: figure 3.8 shows the map flowchart
Figure 3.8 The Mapper flowchart
37
2. Reducer: figure 3.9 shows the reduce flowchart.
Figure 3.9 The reducer flowchart
38
CHAPTER 4 EXPERIMENTS AND RESULTS
In this section, we present experiments and results for two classification tasks: sentiment
classification for Scottish independence vote: positive vs negative and sentiment
classification for Hawai‘i tourism spot: positive vs negative. For each of the sentiment
classification, we follow the procedures described in figure 4.1 below and Naive Bayes
classifier is applied to classify the datasets into positive and negative class.
Figure 4.1 Data analysis process
4.1 Scottish Independence Vote Analysis
The Scottish independence vote was a referendum on Scottish independence which took
place in Scotland on 18 September 2014. The voters answered “Should Scotland be an
independent country?” with “Yes” or “No” to decide whether Scotland should be
independent [19].
We extracted tweets from Twitter for opinion mining to predict the result of voting. To
make sure all the data sets we collected from Twitter refer to the Scotland independent
vote, we used keywords concerning the event as search arguments. We extracted tweets
39
via Twitter Stream API over frequent intervals, thus, we had the timestamp, author and
tweet text for opinion evaluation. About one million tweets were gathered over a period
of ten days around the Scottish Independence vote date. Since the independence polling
took place on 18 September, we extracted tweets from 11 September to 20 September for
sentiment analysis.
Figure 4.2 The curve for tweets over collecting period based on different keywords
Figure 4.2 shows the time series trend in the amount of tweets for Scottish polling over
the collecting period. We can observe that the busiest time for the voting is at September
18 2014 which is reasonable since the event happened at that day. After September 18,
there are less and less people discussed the event since the polling process was ended,
thus, less and less tweets concerning the topic could be collected and the curves come
down.
09/11/14 09/12/14 09/13/14 09/14/14 09/15/14 09/16/14 09/17/14 09/18/14 09/19/14 09/20/140
2
4
6
8
10
12
14x 10
4
Date
Tw
eets
Am
ount
Scotland
Scottish
VoteIndependence
Independent
40
Figure 4.3 Scottish independence vote polarity values
Figure 4.3 displays the amount of tweets about the people attitudes over time. As we can
read from the figure for the independent vote most tweets about the topic were published
via Twitter and many twitter uses have neutral attitude comparing to positive and
negative attitude. .
09/11/14 09/12/14 09/13/14 09/14/14 09/15/14 09/16/14 09/17/14 09/18/14 09/19/14 09/20/140
0.5
1
1.5
2
2.5
3
3.5
4
4.5x 10
5
Date
Tw
eets
Am
ount
Total
PostiveNegative
Neutral
41
Figure 4.4 The attitude distribution based on different keywords
The pie chart A, B, C, D, and E of the Figure 4.4 shows the attitude distribution based on
different keywords.
16%
29%
55%
Postive
Negative
Neutral
13%
27%
60%
Postive
NegativeNeutral
21%
33%
46%
Postive
Negative
Neutral
6%
21%
73%
Postive
NegativeNeutral
3%
84%
13%
Postive
NegativeNeutral
E
C D
A B
42
Figure 4.5 Distribution of authors’ political standpoint toward Scottish Independence
vote: values of positive vs negative
Figure 4.5 shows the authors’ political standpoint toward Scottish independence vote.
The x axis represents the time period over the ten days, thus, there are 240 hours totally.
The y axis represents the degree of authors’ attitude. The blue points represent positive
results standing for supporting independence of Scotland while negative results are
marked by red points standing for opposing to the Scotland independence. As we can
observe at Sep 18, the peak point appeared.
4.2 Some Hawai‘i Tourism Sites Analysis
Hawai‘i islands which are Hawai‘i, O‘ahu, Maui, Kaua‘i, and Lāna‘i are located in the
Pacific Ocean and have significant tourism [24]. In 2013, according to Hawai‘i
government 2013 annual report, there were over 8 million visitors to the Hawaiian
Islands with expenditures of over $15 billion [25]. The most popular times for tourist are
e summer months and major holidays, therefore, our tweets collecting period is from
August 23 to September 20, 2014. In this study, we mainly collect data concerning these
topics: Hawai‘i—the name of the islands, Waikīkī—well known for Waikīkī beach which
20 40 60 80 100 120 140 160 180 200 220 240-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
43
is the most popular beach of O‘ahu; Diamond head—the name of a volcanic and a major
tourist attraction on O‘ahu; Hanauma bay—famous for snorkeling; and Hawai‘i
airlines—the largest airline in Hawai‘i [26][27][28][29].
Figure 4.6 shows the distribution for attitude polarity from August 23 to September 20
periods based on keyword “Hawai‘i”. The blue circles in the figure represent the positive
attitude degree, the red stars represent negative attitude, and the pink points represents the
average of positive and negative. As we can observed, all the average points are above
zero line over the collecting period, thus, the authors of Twitter have positive attitude
comments on Hawai‘i.
Figure 4.6 Distribution for attitude polarity over collecting period based on keyword
“Hawai‘i”
5 10 15 20 25-0.05
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0.05
Positive
Negative
Average
44
Figure 4.7 shows the distribution for attitude polarity from August 23 to September 20
periods based on keyword “Waikīkī”. We can observe that the pink points which are the
average of positive and negative values are above zero line. Therefore, from the figure,
we can conclude that authors have positive attitude toward Waikīkī.
Figure 4.7 Distribution for attitude polarity over collecting period based on keyword
“Waikīkī”
Figure 4.8 describes the distribution for attitude polarity from August 23 to September 20
periods based on keyword “Diamond head”. People’s average attitude is positive toward
Diamond head, since the average values represented by pink points are above the zero
line.
5 10 15 20 25
-0.1
-0.05
0
0.05
0.1
0.15
Positive
Negative
Average
45
Figure 4.8 Distribution for attitude polarity over collecting period based on keyword
“Diamond head”
Figure 4.9 shows the distribution for attitude polarity from August 23 to September 20
periods based on keyword “Hanauma bay”. The average values are above zero line, thus,
authors have positive comments on Hanauma bay.
5 10 15 20 25-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Positive
Negative
Average
46
Figure 4.9 Distribution for attitude polarity over collecting period based on keyword
“Hanauma bay”
4.3 Performance Environment
In this experiment, we use one server as master, and two servers as slaves, on which we
installed Ubuntu system and Hadoop models. The environment of experiment is
described in Table 4.1 below.
5 10 15 20 25-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Positive
Negative
Average
47
Table 4.1 The environment of the experiment
48
CHAPTER 5 CONCLUSION AND OPEN ISSUES
This study presents a method to collect datasets which is concerning some specific topics
from Twitter database via Twitter API. We extracts features from Tweets and use Naive
Bayes classifier separate the data into two classes: positive and negative for opinion
evaluation toward some topics and issues. In this study, we store original dataset in
HDFS filesystem and analyze the datasets using Hadoop MapReduce model. We
visualized analyzing results by using Matlab. The experiment results prove that the
present method performs efficient.
Although this thesis evaluate the views of authors of Twitter: predict the Scottish
independence vote result and analyze tourists attitude toward some popular tourist
attractions in Hawai‘i, there are many open issues that still require further investigation
and research work. In this paragraph some of the open issues that are worth of attention
in relation to this thesis work are discussed. This thesis uses Naive Bayes classifier for
classification, in the future work we may modify it to improve its performance or try
other classifier to overcome the independence assumption.
49
REFERENCES
[1] Matthew A. Russell, Mining the Social Web, O’Reilly, 2011
[2] http://www.internetlivestats.com/twitter-statistics
[3] S. B. Kotsiantis, Supervised Machine Learning: A Review of Classification
Techniques, Proceedings of the 2007 conference on Emerging Artificial Intelligence
Applications in Computer Engineering: Real Word AI Systems with Applications in
eHealth, HCI, Information Retrieval and Pervasive Technologies, pp. 3-24, June 2007.
[4] Alexander Pak, Patrick Paroubek, Twitter as a Corpus for sentiment analysis and
opinion mining, LREC 2010, Seventh International Conference on Language Resources
and Evaluation, May 2010.
[5] Predicting the Future with Social Media, 2010 IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp
492 – 499, 2010.
[6] Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, Rebecca Passonneau,
Sentiment Analysis of Twitter Data, LSM '11 Proceedings of the Workshop on
Languages in Social Media, pp.30-38, June 2011.
[7] Hsiang Hui Lek and Poo, D.C.C., Aspect-based Twitter Sentiment Classification,
2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI),
pp.366-373, November 2013
[8] http://bigdatanerd.wordpress.com/2012/02/12/hadoop-vs-rdbms-where-hadoop-cores-
over-rdbms/
[9] Tom White, Hadoop: The Definitive Guide, Third Edition, O’Reilly, 2011.
[10] https://dev.twitter.com/docs/api/1.1/post/statuses/filter
[11].https://dev.twitter.com/docs/auth/oauth
[12] https://dev.twitter.com/docs/streaming-apis/parameters#track
[13] http://json.org/
[14] http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
[15] http://en.wikipedia.org/wiki/Apache_Hadoop
[16] Bing Liu, Sentiment Analysis and Opinion Mining, Graeme Hirst, 2012.
50
[17] Shamanth Kumar, Fred Morstatter, Huan Liu, Twitter Data Analytics, Springer,
2013
[18]. Alex Holmes, Hadoop in Practice, Manning Shelter Island, 2012
[19] http://en.wikipedia.org/wiki/Scottish_independence_referendum,_2014
[20] http://www.pythoncentral.io/introduction-to-tweepy-twitter-for-python/
[21] http://tweepy.readthedocs.org/en/v2.3.0/getting_started.html
[22] https://dev.twitter.com/overview/documentation
[23] http://hadoop.apache.org/docs/r1.2.1/streaming.html#Hadoop+Streaming
[24] http://en.wikipedia.org/wiki/Tourism_in_Hawaii
[25] http://dbedt.hawaii.gov/visitor/
[26] http://en.wikipedia.org/wiki/Waikiki
[27] http://en.wikipedia.org/wiki/Diamond_Head,_Hawaii
[28] http://www.tripadvisor.com/
[29] http://www.yelp.com/